Module IV
Introduction to Correlation and Regression
• Correlation – Meaning - Positive, negative and zero
correlation, Correlation through Scatter diagrams,
Interpretation of Correlation Co-efficient, Simple and
Multiple Correlation; Regression
Correlation
Introduction to Correlation Analysis
• In previous studies, the focus was on univariate analysis, which deals
with one variable at a time. This includes measures like central
tendency, dispersion, skewness, and kurtosis.
• However, there are situations where two variables need to be studied
together, such as height and weight of individuals or sales revenue and
advertising expenditure of a company. This type of analysis, where two
variables are studied together, is known as bivariate analysis. If more
than two variables are studied, it becomes multivariate analysis.
• The goal of correlation analysis is to find if a relationship exists
between two variables and to measure the strength of this relationship.
Definitions and Uses of Correlation
• Croxton and Cowden define correlation as a statistical tool used
to measure and express the relationship between two variables in
a concise formula.
• A.M. Tuttle refers to correlation as the study of how two variables
co-vary (change together).
• W.A. Neiswanger highlights that correlation helps understand
economic behavior and identify important variables.
• Tippett states that correlation reduces uncertainty in prediction
by showing how one variable influences the other.
Types of Correlation
Positive and Negative Correlation:
➢Positive correlation occurs when two variables move in the same direction.
For example, if an increase in one variable leads to an increase in the other,
such as height and weight, or family income and luxury spending.
➢Negative correlation occurs when the variables move in opposite directions.
For example, an increase in the price of a commodity typically leads to a
decrease in demand, or a rise in temperature reduces the sale of woolen
garments.
Linear and Non-Linear Correlation:
➢Linear correlation is when a constant change in one variable leads to a
constant change in the other. For instance, the example where for every
increase of 1 in variable "x," there is a constant increase of 2 in "y."
➢Non-linear (curvilinear) correlation occurs when the change in one
variable does not result in a constant change in the other. Instead, the change
fluctuates, and the graph of such data would not be a straight line.
Correlation and Causation:
➢Correlation measures the degree to which two variables move together but
does not necessarily imply that one variable causes the other.
➢Causation means that changes in one variable directly cause changes in
another. While causation implies correlation, the reverse is not true. High
correlation might exist due to mutual dependence, influence from external
factors, or pure chance (spurious correlation).
Spurious Correlation:
• Sometimes, two variables may show a high degree of correlation
even though they are not related. For example, the size of a
person’s shoe and their intelligence may show correlation in a
small, randomly selected sample, but the relationship is
nonsensical. This is called spurious correlation or nonsense
correlation.
Methods of Studying Correlation:
Several methods are commonly used to study the relationship between
two variables:
• Scatter Diagram Method: A graphical representation where the relationship
between two variables is plotted as points in a graph.
• Karl Pearson’s Coefficient of Correlation: A numerical method that measures
the strength and direction of a linear relationship between two variables. This
method uses covariance to calculate correlation.
• Two-Way Frequency Table (Bivariate Correlation Method): A table that
records the frequency of different combinations of two variables.
• Rank Correlation Method: This method ranks the data for each variable and
then measures the correlation between the ranks.
• Concurrent Deviations Method: A simplified method that examines the
direction of changes in the two variables to determine whether they are
correlated.
Scatter Diagram Method
What is a Scatter Diagram?
• A scatter diagram is a plot of n pairs of values
of two variables (e.g., height and weight of individuals).
• Each pair is plotted as a point on a graph, with one variable on the
x-axis and the other on the y-axis.
• The variable plotted on the x-axis is typically the independent
variable, while the variable on the y-axis is the dependent
variable.
How to Interpret a Scatter Diagram?
The scatter diagram helps in visually interpreting the relationship
between two variables:
• Density of Points:
• If the points are close together, it suggests a high correlation between
the two variables.
• If the points are scattered, it indicates a low correlation.
• Trends:
• If the points follow a clear upward trend, it suggests positive correlation,
meaning that as one variable increases, the other also increases.
• If the points show a downward trend, it suggests negative correlation,
meaning that as one variable increases, the other decreases.
• If there is no clear trend, it suggests no correlation between the variables.
• Perfect Correlation:
• Perfect positive correlation: All points lie exactly on a straight line going
from the bottom-left to the top-right.
• Perfect negative correlation: All points lie on a straight line from the
top-left to the bottom-right.
Different Forms of Correlation on
Scatter Diagrams:
The diagram shows various forms of correlation:
• Perfect positive correlation: A straight line from the bottom-left to the top-
right.
• Perfect negative correlation: A straight line from the top-left to the bottom-
right.
• High degree of positive correlation: Points form a dense, upward trend but
may not be perfectly aligned.
• High degree of negative correlation: Points form a dense, downward trend.
• Low degree of positive/negative correlation: Points show an upward or
downward trend but are more spread out.
• No correlation: Points are scattered randomly with no discernible trend.
Advantages and Disadvantages of the
Scatter Diagram Method:
• Advantages:
• Easy to understand: The method is visually intuitive and can give a rough idea
of the nature of the relationship between two variables by simply inspecting the
graph.
• Resistant to extreme observations: Unlike mathematical methods, scatter
diagrams are not significantly influenced by extreme values or outliers.
• Disadvantages:
• Not precise: The scatter diagram only provides a rough idea of whether the
correlation is positive, negative, strong, or weak. It does not give an exact
measure of the correlation.
• Not suitable for large datasets: When there are many data points, the diagram
becomes cluttered and difficult to interpret.
KARL PEARSON’S
COEFFICIENT OF
CORRELATION
(COVARIANCE METHOD)
Karl Pearson's Coefficient of Correlation
• Karl Pearson's Coefficient of Correlation is a widely used
mathematical method to measure the strength and direction of the
linear relationship between two variables, denoted as r. This
coefficient ranges from -1 to +1, where:
• +1 indicates a perfect positive linear correlation,
• -1 indicates a perfect negative linear correlation, and
• 0 indicates no linear relationship between the variables.
Formula Overview
1 st step -Covariance
OR
In statistics, a variance is the spread of a data set around its mean value, while a covariance is the measure of the
directional relationship between two random variables.
2 nd step- Standard deviation
Correlation Coefficient
• Pearson’s Correlation Coefficient (r): After calculating the
covariance and the standard deviations, Pearson's coefficient is:
• Simplified Formula
Illustration 1
Illustration 2
Determine the coefficient of correlation for the following data
Illustration 3
Determine the coefficient of correlation for the following data:
Illustration 4
Determine the coefficient of correlation for the following data:
X3 2 1 5 4
Y8 4 10 2 6
Illustration 5
• Find Karl Pearson's coefficient of correlation from the following
index numbers and interpret it.
Wages (₹) 100 101 103 102 100 99 97 98 96 95
Cost of living 98 99 99 97 95 92 95 94 90 91
Spearman’s Rank
Correlation Coefficient (ρ)
Spearman’s Rank Correlation Coefficient (ρ)
• Rank Correlation Method is used when the variables under
study cannot be measured quantitatively but can be ranked based
on qualitative attributes like intelligence, beauty, or honesty.
• This method is especially helpful when we want to see if two sets
of rankings are related.
• Spearman’s Rank Correlation Coefficient (ρ)
• Spearman's Rank Correlation Coefficient, denoted by ρ(rho),
measures the correlation between the ranks of two variables.
Formula
Illustration 6
Illustration 7
• Solution. Let X denote the advertisement cost (’000 Rs.) and Y denote
the sales (lakhs Rs.)
Illustration 8
Linear Regression Analysis
Meaning
• The term regression in statistics refers to a method used to study
the relationship between two or more variables.
• It allows us to predict the value of one variable (the dependent
variable) based on the known value of another variable (the
independent variable). Essentially, it helps estimate how one
variable changes as another variable changes.
Regression answers questions like:
• How does sales change with an increase in advertising spending?
• How does income change with years of education?
Types of Regression
• Simple regression: Involves two variables — one dependent and
one independent variable.
• Multiple regression: Studies the impact of multiple independent
variables on a dependent variable.
More about regression:
• Dependent variable (Y): The variable you are trying to predict.
• Independent variable (X): The variable used to make
predictions.
• For example, in linear regression, the relationship between
variables is represented by a straight line. The formula typically
used is:
Linear and Non-Linear Regression
• Linear regression occurs when the relationship between two
variables is represented by a straight line. The equation of this line
is given by where a is the intercept and b is the slope.
• Non-linear regression involves more complex relationships
between variables, where the equation includes higher-degree
terms like
Lines of Regression
Regression Equation of Y on X
OR
Regression Equation of X on Y
OR
Example Scenario
• Imagine you want to analyze the relationship between hours studied
and scores obtained in an exam. You collect the following data:
Regression Line
The regression line is a mathematical model that predicts the
dependent variable (exam score) based on the independent
variable (hours studied). The equation of the regression line can
be expressed as:
Illustration 9
• From the following data find two regression equation
Illustration 10
Coefficient of determination
• The coefficient of determination, denoted as R², is a statistical
measure that indicates how well a regression model fits the data.
• It represents the proportion of the variance in the dependent
variable that is predictable from the independent variable(s).
• In simpler terms, it tells us how well the independent variables
explain the variability of the dependent variable.
Illustration 11
• Compute the appropriate regression equation for the following data.
Illustration 12
Using Regression for ANSWER
Prediction 1)company’s marketing department has decided to spend
With these models estimate: Rs.2,50,000/- (X = 2.5) on advertisement during the next
i) the value of sales when the quarter, the most likely estimate of sales
company decided to spend = 16.15 + 5.8 (2.5) = 30.65
Rs. 2,50,000 on advertising, = Rs. 30,65,000
and 2)when company desires to get the target of Rs. 50 lakhs
ii) the cost of advertisement sales during next quarter, the most likely estimation of
when the company desires advertisement cost = −0.25 + 0.093 (50)
to reach the target of Rs. 50 = −0.25 + 4.65 = 4.4
Lakhs during the next = Rs. 4,40,000.
quarter.
Also called Least Squares Regression
• Least Squares Regression is a statistical technique that aims to find the
line of best fit through a set of data points by minimizing the sum of the
squares of the vertical distances (the residuals) between the observed
data points and the points on the fitted line
How It Works:
• The formula for bxy or bxy (the regression coefficients) minimizes the
error (or residuals) in the vertical direction for one variable (like Y)
when trying to predict it from another variable (like X).
• The idea behind least squares is to ensure that the total squared
difference between the actual values and the predicted values (on the
regression line) is as small as possible.
• Y on X regression finds the best-fitting line to predict Y from X.
• X on Y regression finds the best-fitting line to predict X from Y.
• In both cases, we are using least squares to minimize the sum of the
squared errors between the observed values and the values
predicted by the regression line.
• where the goal is to minimize the difference between the actual
data points and the line of best fit.
Standard Error (SE)
• The standard error (SE) is a measure that describes how much
sample means differ from the actual population mean.
• It provides insight into the precision of the sample mean as an
estimate of the population mean.
• In other words, it tells us how much we might expect the sample
mean to vary if we were to take multiple samples from the same
population.
Formula
• The formula for the standard error of the mean (SEM) is:
where:
• σ = is the standard deviation of the population,
• N= is the sample size.
• If the population standard deviation (σ) is unknown, the sample
standard deviation (s) is often used instead, especially in small
samples.
Index numbers
Index numbers
• Index numbers are tools that help us understand how certain
things (like prices or production levels) change over time.
• They allow us to compare changes in a specific period (called the
"current period") with a past period (called the "base period").
• The things we measure can be prices of items (like gold, steel, or
milk), production levels (like how much a factory produces), or
even broader topics like national income or the cost of living.
• The main types of index numbers include:
1.Price Index: Measures changes in the price level of goods and
services over time. Examples include the Consumer Price Index
(CPI) and the Producer Price Index (PPI).
2.Quantity Index: Tracks changes in the quantity or output levels,
commonly used to measure production volume in industries.
3.Value Index: Measures changes in the total value of transactions,
combining both price and quantity indices, which can be
particularly useful for analyzing economic growth.
• Key Points:
1.Relative Changes: Index numbers show how much something has
increased or decreased compared to a past time.
2.Different Variables: They can measure changes in prices,
production, wages, or economic factors over time.
3.Simplifying Complex Changes: Since different items (like rice,
milk, or fuel) have prices measured in different units (kilograms,
liters, etc.), it’s hard to compare them directly. Index numbers
provide a single number that summarizes the overall change,
making it easier to understand.
Example
• If you want to know how prices of everyday goods (like food, fuel,
and clothing) have changed over the last five years for low-income
families, you can't just look at one or two prices because some may
have gone up while others went down. An index number helps you
get a general idea of how prices have changed overall.
• In simpler terms, index numbers give us a snapshot of change in
various things over time, helping us track and compare these
changes in a straightforward way.
METHODS OF CONSTRUCTING INDEX
NUMBERS
Simple Aggregate Method.
Fisher's Ideal Index
• Fisher's Index, also known as Fisher's Ideal Index, is a type of
index number that combines two other popular indices: the
Laspeyres Index and the Paasche Index.
• It provides a balanced or "ideal" measurement of price changes by
taking the geometric mean of these two indices.
Formula:
Illustration 12
On the basis of the following information, calculate Fisher's
index number:
Solution
Illustration 13
Construct Fisher’s Ideal Index Number using the data
given below.
Time Reversal Test and Factor Reversal
Test
Time Reversal Test and Factor Reversal Test are two important consistency checks for
index numbers, particularly when measuring price and quantity changes over time.
Time Reversal Test
• The Time Reversal Test checks whether the index number remains consistent if we
reverse the time periods.
• In simple terms, if you calculate an index from the base period to the current period
and then reverse it (from the current back to the base), the product of these two
indexes should equal 1 (or 100 if we express it as a percentage).
For example:
• If the price index from period 0 to period 1 is 120, the index from period 1 back to
period 0 should ideally be 100 / 120 = 0.8333 (or 83.33).
• The Time Reversal Test is passed if the multiplication of these two indexes results in
1.
• This test helps confirm that the index number does not depend on the direction of
time.
2. Factor Reversal Test:
• The Factor Reversal Test checks whether an index number
gives consistent results when we swap the roles of prices and
quantities (factors) while calculating value changes.
• In other words, the product of a price index and a quantity
index should equal the value index (which measures total
monetary value).
• According to the Factor Reversal Test, should give the value
index, which reflects the overall change in value over time.
• This test ensures that both price and quantity changes are
accurately captured together in a combined value measure
University questions
Rank correlation
Correlation versus covariance
Coefficient of determination
Standard error
What is index numbers explain the concept and uses
Explain the regression analysis. What is the basic form of a regression
equation? Interpret the terms in the equation.
What do dependent and independent variable indicate in a simple
regression. Explain with the help of an example.