Correlation
&
Regression
Dr. Anand Tiwari
Department of Chemical Engineering
Dharmsinh Desai University
Content
Topics COs
1. Data Definitions and Working with data CO1
Qualitative and Quantitative data, Discrete and Continuous Data, Frequency Measures,
Graphical Representation Collection and summarizing data
2. Probability and Distribution CO2
Probability Distribution, Random variables and sampling, Normal Distribution,
3. Statistical Inference CO3
Central Tendencies, Confidence intervals, test of hypothesis
4. Regression and Correlation CO4
Single and multiple linear regression, Non-linear regression, relation between correlation and
regression, Goodness of fitting
5. Analysis of Variance – ANOVA CO5
Logic behind ANOVA, Single Factor and Two factor ANOVA
Experimental Data
Overview
Error analysis and propagation
Data Correlation
Data Regression
Sources of Errors
• Sources of errors can be classified in 3 broad groups –
• Errors of Measurement
• Errors of measurement are due to physical limitations of scaling. Without a Vernier
attachment, a length of 2 cm could easily be in error by 0.5 mm, which is 2.5%.
• Precision Errors
• Precision errors are the built-in errors of the apparatus. For example, the scales on
mercury in a glass thermometer or a burette or pipette assume that the bore is uniform.
Which assumes a change of 1 unit is constant throughout.
• Errors of Method
• Errors of method include faults such as neglecting heat losses, assuming constant
overflow, or neglecting back-mixing in a tubular reactor etc.
• All these errors can only be estimated and rarely measured
• An error is usually given up to only one or two significant figures.
Terminologies
• Accuracy, Error, Precision, and Uncertainty
• All measured quantities are subject to uncertainties
• The variability in the results of repeated measurements arises
because the absolute values are impossible to reproduce
• The result of any physical measurement has
two essential components:
• numerical value (in a specified system of units) giving the best
estimate possible of the quantity measured
• degree of uncertainty associated with this estimated value
• Errors can be thought of as issues with equipment or methodology that
cause a reading to be different from the true value
• The uncertainty is a range of values around a measurement within
which the true value is expected to lie, and is an estimate
• True Value cannot be absolutely determined. Trueness is the
closeness of agreement between the average value obtained
from a large series of test results and the accepted true. Trueness
is largely affected by systematic error
• Accuracy is the closeness of agreement between a measured
value and the true value. Accuracy is an expression of the lack
of error.
• Precision is the closeness of agreement between independent
measurements of a quantity under the same conditions.
Precession is largely affected by random error
• Uncertainty characterizes the range of values within which the
true value is asserted to lie with some level of confidence. For
example concentration measured C = 2.6 0.15 mol/L
• Error is the difference between the true value and the measured
value. The total error is a combination of both systematic error
and random error.
8
Errors
• Experimental data is subject to various errors hence any calculations based on
that is having limited accuracy
• The way in which errors accumulate/propagate is governed by different
operations being performed (addition, subtraction, division or multiplication)
• Absolute Error - Numerical difference between true value and
approximated/estimated value
• If x is the experimental reading (true value), then the error in the x is -
𝜀𝐴 = 𝛿𝑥 = 𝑥 − 𝑥 ′ ■ Absolute accuracy ∆𝑥 = 𝑥 − 𝑥 ′
𝛿𝑥
■ Relative error is given as 𝜀𝑅 = ′ ∆𝑥 ∆𝑥
𝑥 ■ Relative accuracy ≈ ′
𝑥 𝑥
■ Percentage error is given as 𝜀𝑃 = 100𝜀𝑅
Error Propagation
■ Suppose z is estimated from two measured variables x and y having uncertainty
± ∆𝑥 and ±∆𝑦
𝒛=𝒙±𝒚 ∆𝒛 = (∆𝒙)𝟐 +(∆𝒚)𝟐
■ Propagation through multiplication/division 𝒛 = 𝒙𝒚 𝒛 = 𝒙/𝒚
𝟐 𝟐
∆𝒛 ∆𝒙 ∆𝒚
= +
𝒛 𝒙 𝒚
Error Propagation
■ Propagation through Exponents 𝒛 = 𝒙𝒎 𝒚𝒏
𝟐 𝟐
∆𝒛 ∆𝒙 ∆𝒚
= 𝒎 + 𝒏
𝒛 𝒙 𝒚
■ Example: Suppose you want to estimate the specific heat of a iron rod for a
given measured values. What would be the uncertainty in estimation of
Specific Heat?
Measured Value Unit Uncertainty
𝑞 = 𝑚𝐶𝑝 𝑇2 − 𝑇1
T2 = 45.6 C 0.1
T1 = 40.4 C 0.1
𝑞
𝐶𝑝 =
(T2-T1) = C 𝑚 𝑇2 − 𝑇1
I= 10.2 A 0.02
V= 13.5 V 0.1 𝑞 = 𝑉𝐼𝑡
t= 125 s 1
m= 0.72 kg 0.01
d= 3.38 cm 0.03
h= 10.2 cm 0.03
Coefficient of Correlation
• Coefficient of correlation (r) – It is a measure of the strength of the
linear relationship between two variables x and y in the sample.
• The correlation coefficient r is scaleless. The value of r is always
between and -1 & +1, no matter what the units of x and y are.
• This is also called Pearson correlation coefficient
• High correlation does not imply causality. If a large positive or
negative value of the sample correlation coefficient r is observed, it is
incorrect to conclude that a change in x causes a change in y.
• When r is near or equal to 0
implies little or no linear
relationship between y and x.
• The closer r is to 1 or -1, the
stronger the linear relationship
between y and x. And, if +1 or -
1, all the points fall exactly on
the least-squares line.
• Positive values of r imply that y
increases as x increases;
negative values imply that y
decreases as x increases
assumptions
Data must meet following criteria to use Pearson’s Correlation -
• Both variables are on an interval or ratio level of measurement
• Data from both variables follow normal distributions
• Your data have no outliers
• Your data is from a random or representative sample
• You expect a linear relationship between the two variables
𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 ± 𝜀𝑖
𝜀𝑖 = 𝑦𝑖 − 𝑦ො𝑖
Minimize 𝑛
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑎 + 𝑏𝑥𝑖 2
𝑖=1
𝜕𝑆𝑆𝐸 𝜕𝑆𝑆𝐸
𝜕𝑎 𝜕𝑏
Equate the differentials to Zero and Solve for a & b
1
𝑆𝑆𝑥𝑦 = 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത = 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖
𝑛
2
1
𝑆𝑆𝑥𝑥 = 𝑥𝑖 − 𝑥ҧ 2 = 𝑥𝑖2 − 𝑥𝑖
𝑛
• Compression vs. Pressure for a material used in pressure vessel
Pressure x Compression y Calculate the Coefficient of Correlation
(kg/m2) (mm)
1 25.4 𝑆𝑆𝑥𝑦 = 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
2 26.2
1
3 51 = 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖
𝑛
4 53.6
5 101.2 2
1
𝑆𝑆𝑥𝑥 = 𝑥𝑖 − 𝑥ҧ 2
= 𝑥𝑖2 − 𝑥𝑖
𝑛
2
1
𝑆𝑆𝑦𝑦 = 𝑦𝑖 − 𝑦ത 2
= 𝑦𝑖2 − 𝑦𝑖
𝑛
Coefficient of Determination
• A way to measure the contribution of x in predicting y is to consider
how much the errors of prediction of y can be reduced by using the
information provided by x.
• If x contributes little or no information for the prediction of y, then the
sums of squares of deviations for the two lines then 𝑆𝑆𝑦𝑦 = 𝑆𝑆𝐸
• If x does contribute information for the prediction of y, then SSE will
be smaller than 𝑆𝑆𝑦𝑦 . In fact, if all the points fall on the least-squares
line, then SSE = 0
• It is represented as r2 given as measure of fitting or prediction.
Spearman Rank Correlation
• Most common alternative to Pearson’s r. It uses the rankings of data
from each variable (e.g., from lowest to highest) rather than the raw
data itself.
• It is used when the assumptions of Pearson Correlation does not meet
i.e. when distribution is not normal or non-linearity is present.
6 σ 𝑑𝑖2
𝑟𝑠 = 1 − 3
𝑛 −𝑛
• rs= strength of the rank correlation between variables
• di = the difference between the x-variable rank and the y-variable rank
for each pair of data
• σ 𝑑𝑖2 = sum of the squared differences between x- and y-variable ranks
• n = sample size
• Work-Hours Missed, Annual Wages, for 15 Employees
Data Regression – Curve Fitting
• Provides the method to represent the set of experimental data in the
form of empirical equations.
• Linear regression applied to any function that is linear in the coefficients
• Non-linear regression applied to any function that is non-linear in the
coefficients
• Different methods for regression are –
• Analytical Method
• Method of Averages
• Method of Least Squares
Method of Averages
• Suppose nine experimental values of a variable Z are available at
nine different known values of x, and the best-fit curve of the type
𝒁 = 𝑨 + 𝑩𝒙 + 𝑪𝒆𝒙
■ Number of parameters – 3
■ Number of equations needed – 3
• For method of averages, the following points must be observed:
1. Points must be arranged in ascending values of independent variable x.
2. The number of groups must equal the number of unknown parameters.
3. Groups should contain approximately equal number of points:
• 9 points → 3 points in each group
• 8 points → 3 points + 2 points + 3 points
• 10 points → 3 points + 4 points + 3 points
4. Each experimental point should be used only once.
5. The appropriate average must be taken.
Method of least squares
𝑦 = 𝑚𝑥 + 𝑐 𝑦 − 𝑚𝑥 − 𝑐 = 𝑅 𝑧 = 𝑦 − 𝑚𝑥 − 𝑐 2 = 𝑅2
𝜕𝑧 𝜕𝑧
=0 =0
𝜕𝑚 𝜕𝑐
𝑁 σ 𝑥𝑛 𝑦𝑛 − σ 𝑥𝑛 σ 𝑦𝑛 σ 𝑥𝑛2 σ 𝑦𝑛 − σ 𝑥𝑛 𝑦𝑛 σ 𝑥𝑛
𝑚= 𝑐=
𝑁 σ 𝑥𝑛2 − σ 𝑥𝑛 2 𝑁 σ 𝑥𝑛2 − σ 𝑥𝑛 2
Linear regression
■ Used for function that is linear in the coefficients the actual function may be any
𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2 + 𝑑𝑥 3
𝑦 = 𝑎𝑒 𝑥
■ The least-square is used for best fit that minimizes the sum of the squares of the y
deviations of individual points from the line
■ Correlation coefficient, R, is used as a measure of the correlation between x and y,
where −1 ≤ 𝑅 ≤ 1
■ R2 is used ranging from 0 to 1 as a measure of goodness of fit. The preferable value is >
0.9 defined as good fit.
■ Multiple linear regression fits data to a model that defines y as a function of two or
more independent x variables
𝑦 = 𝑎𝑇 + 𝑏𝑃 + 𝑐𝑡 + 𝑑
Non-linear Regression
■ Parameters are linear in coefficient 𝑦 = 𝑎𝑒 ±𝑏𝑥
■ Equations that cannot be converted into a linear form and are said to be
intrinsically nonlinear
■ Chemical kinetics - a system of two consecutive first-order reactions
𝑘1 𝑘2 𝑘1
𝐴՜𝐵՜𝐶 𝐶𝐵 = 𝐶𝐴0 𝑒 −𝑘1𝑡 − 𝑒 −𝑘2𝑡
𝑘2 − 𝑘1
■ There are no analytical expressions to obtain the set of regression
coefficients for a fitting function that is nonlinear in its coefficients
■ Approach is to guess the values of parameters and find the solution that
matches closely with measured data.
■ Goal is to minimize sum of square error
2
𝑆𝑆𝐸 = 𝐶𝐵_𝐶𝑎𝑙𝑐 − 𝐶𝐵_𝑀𝑒𝑠
■ With trial and error best guess in parameters are made such that SSE is
minimum
Problem-1
Nu Re Pr
24.8 7000 0.46 Fit and find out the parameters a, m and n
28.5 7600 0.63
𝑁𝑢 = 𝑎𝑅𝑒 𝑚 𝑃𝑟 𝑛
60.3 12000 4.2
58.4 11700 3
84.5 14300 10 Use both method of averages and least squares
115 17000 18.6
170 19000 41
193 20100 58.5
140 17900 32
189 19700 70.3
315 25000 185
480 29300 590
Problem-2
• The data tabulated below for an experiment to determine the growth rate of
bacteria k per day as a function of oxygen concentration C (mg/L).
Following model is proposed for the growth rate
Estimate Cs and kmax and compare the parameters using methods of
averages and method of least square. What would be growth rate at C =
2mg/L?