ENGINEERING DATA ANALYSIS
University of Southeastern Philippines
COLLEGE OF ENGINEERING
Obrero, Davao City
MATH 212
ENGINEERING DATA ANALYSIS
DALIA M. RECONALLA, Ph.D
August 2020
1|Page
ENGINEERING DATA ANALYSIS
Faculty Information:
Name: Dalia M. Reconalla
Email:
[email protected]Contact Number: 0906-209-6611
Office: College of Engineering
Contact Number: (082) 224-3334
Consultation Hours: By appointment - may be arranged through:
Official email
Facebook messenger/Facebook group chat
Text or call
Getting help
For academic concerns (College/Adviser - Contact details)
For administrative concerns (College Dean - Contact details)
For UVE concerns (KMD - Contact details)
For health and wellness concerns (UAGC, HSD and OSAS - Contact
details)
2|Page
ENGINEERING DATA ANALYSIS
TABLE OF CONTENTS
CONTENTS PAGE
Cover page ………………………………… 1
Faculty Information ……………………………….... 2
Table of Contents ………………………………… 3
Lesson 3 ………..……………………………….. 14
Application 3…………………………………………. 15
Module Summary ………………………………… ... 16
Module Assessment ……………………………….. 17
References ………………………………………….. 18
F-Distribution Table …………………………………. 19
3|Page
ENGINEERING DATA ANALYSIS
Learning Outcome:
o Estimate the value of the response variable from a given value of independent
variable.
o Conduct test of hypothesis of the significance about the regression line.
Time Frame: Week 13
Introduction
In most research problem where regression analysis is applied, more than one
independent variable is needed in the regression model. The complexity of most
scientific mechanism is such that in order to be able to predict an important response,
a multiple regression model is needed. When this model is linear in the coefficients, it
is called multiple linear regression model.
Activity
Given a simple linear regression model y = 30.04 + 0.897x , for the
intelligence test score x and the freshmen Math 121 grades y of group of engineering
students. What could be the grade of a student randomly selected with intelligence
test score of 75?
Analysis
Sketch the graph and the regression line and interpret the model.
Abstraction
From Lesson 2(Simple Linear Regression) we learned that when there is one
independent variable or predictor, the regression equation for predicting y from x is
The simultaneous use of two or more independent variables in predicting a dependent
variable is called multiple regression.
When there are two independent variables,
4|Page
ENGINEERING DATA ANALYSIS
̂
where:
̂ = the predicted value
a = the y-intercept
the expected change in y when changes one unit and remains
constant,
value of the first independent variable,
the expected change in y when changes one unit and remains
constant, and
= the value of the second independent variable.
i = number of observations
The equation for two independent variables can be extended to any number of
independent variables, say, k, such as , the mean of y│
( read as y given ) is given by the multiple
regression model:
(k = number of independent variables)
and the estimated response is obtained from the sample regression equation
̂ +... +
Where each regression coefficient is estimated by from the sample data
using the method of least square.
Estimating Coefficients
We shall obtain the least squares estimators of the parameters ,..., by
fitting the multiple linear regression model
To the data points {( n > k},
where is the observed response to values of the k
independent variables
E satisfies the equation
or
where and are the random error and residual, respectively, associated with
the response
In using the concept of least squares to arrive at the estimates ,
we minimize the expression
5|Page
ENGINEERING DATA ANALYSIS
SSE = ∑ ∑ .
Differentiating SSE in turn with respect to and equating to
zero, we generate the set of k + 1 normal equations:
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
. . . . .
. . . . .
. . . .
∑ ∑ ∑ ∑
∑
These equations can solve for by any appropriate method for
solving systems of linear equations and further using vector-matrix approach. But
this method is a tedious process, hence estimating coefficients requires the use of
computer program.
This time, we will only consider two independent variables as example, for ease of
computation using algebraic manual computations where:
̂ where regression coefficients are
determined from the system of equation:
∑ ∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
Example 1: The average monthly electric power consumption (y) at a certain
manufacturing plant is considered to be linearly dependent on the ambient
temperature ( and the number of working days in a month ( Consider a one
year data given in the table.
a. Determine the least-squares estimates of the associated linear regression
coefficients. Find the regression equation representing the average electric power
consumption in terms of ambient temperature and number of working days in a
month.
b. Estimate the average monthly electric power consumption if the plants average
ambient temperature is 48 and the number of working days in a month is 22.
6|Page
ENGINEERING DATA ANALYSIS
Solution: Formulating the equation of regression coefficients:
n = 2 , i = 12. From the system of equations:
Solving for the values of the variables (presented in the table)
Formulating the equation of regression coefficients: n = 2 , i = 12, substituting the
values of the variables:
7|Page
ENGINEERING DATA ANALYSIS
Thus, the system of equation of regression coefficients is:
Find the values of using algebraic method, determinants, or matrices.
The estimated regression equation based on the data represented by the equation
Interpretation:
For every unit change in the ambient temperature, there correspond a 0.39 increase
in average the monthly electric power consumption, holding the number of working
days in a month constant. Likewise, for every increase in the working days in a month
by the company, there is a 10.80 increase in the average monthly power
consumption holding the ambient temperature constant.
b. Estimate the average monthly electric power consumption if the plants average
ambient temperature is 48 and the number of working days in a month is 22.
From the equation ̂ where and
̂
= 222.48
Properties of the Least Squares Estimator
For the linear regression equation
y=x
an unbiased estimate of variance is given by the error or residual mean square
8|Page
ENGINEERING DATA ANALYSIS
where
∑ ∑ ̂
The sum of squares identity
∑ ̅ =∑ ̂ ̅ +∑ ̂ continues to hold.
Sum of squares identity
SST = SSR + SSE
with SST = ∑ ̅ = total sum of squares
and SSR = ∑ ̂ ̅ regression sum of squares.
There are k degrees of freedom associated with SSR and , as always,
SST has n – 1 degrees of freedom.
Inference in Multiple Regression
In the multiple regression analysis, the response variable is described as a function of
more than one predictor variable. Therefore, there are several types of inferences that
can be made using this model. In the simple linear model studied earlier, the test for
the slope (t-test) is equivalent to the test for the utility of the model (F-test). However,
in the multiple regression they differ on the account of having more than one slope
parameter.
A Test of Model Adequacy
To find a statistic that measures how well a multiple regression model fits a set of
data, we use the multiple regression equivalent of , the coefficient of determination
for the straight-line model. Thus, we define the multiple coefficient of determination
, as
Thus, we define the multiple coefficient of determination , as
∑ ̂
∑ ̅
=1
where: ̂ = the predicted value using the underlying model.
= the fraction of the sample variation of the y values
(measured by SSyy) that is explained by the least-squares prediction equation.
Thus, = 0 implies a complete lack of fit of the model to the data, and
= 1 implies a perfect fit, with the model passing through every data point.
In general, , and the larger the value of , the better the model fits the data.
9|Page
ENGINEERING DATA ANALYSIS
The fact that is a sample statistic implies that it can be used to make inference
about the utility of the entire model for predicting the population of y values at each
setting of the independent variables.
Testing the Utility of Multiple Regression Model: The Global F-Test
: At least one of the parameters is nonzero.
F=
Rejection Region: F >
Using the p-value approach: Reject if p value < , where the
p-value = P(F(k, n – [k + 1] > F ).
Conditions:
1. The error component is normally distributed.
2. The mean of is zero.
3. The errors associated with different observations are independent.
Analysis of Variance (ANOVA)
The analysis of variance table for multiple regression problem provides a test of the
null hypothesis
which implies that response variable is not related to any of the k
input variables.
Analysis of Variance (ANOVA) Table
10 | P a g e
ENGINEERING DATA ANALYSIS
The tail values of the F-distribution are given in Tables . The F-test statistic becomes
large as the coefficient of determination becomes large.
To determine how large F must be before we can conclude at a
given significance level that the model is useful for predicting y, we set up the
rejection region (RR) as F > (k, n – [k + 1]).
Example 2. From the given data in Example 1, decide, at the 5% significance level,
whether the data provide sufficient evidence to conclude that the ambient temperature
and the monthly consumption and the number of working days in a month (predictor
variables) are useful for predicting the average monthly power consumption(response
variable).
Solution:
Step 1. State the null and alternative hypotheses.
of the parameters is nonzero.
Step 2 Decide on the significance level, α.
Perform the hypothesis test at the 5% significance level, or α = 0.05.
Step 3. Compute the value of the test statistic (F)
k = 2, n = 12
SSE = ∑ ̂ = 2004.7456
11 | P a g e
ENGINEERING DATA ANALYSIS
SST = ∑ ̅ = 6707.667
Finding
∑ ̂
∑ ̅
=1
=1
So
= = 10.56 .
We can also find F using the formula:
F=
Solving for the means
MSR = = = 2351.4607
MSE = = 222.7495
F=
Step 4. Decide whether to accept or to reject
Compare
Since = 4.2565¸
(4.2565) (Refer to the F Distribution Table –Appendix
Table 1 in this module)
Since (4.2565), we reject the null hypothesis and
conclude that at 5% level of significance, there is sufficient evidence to support that
the ambient temperature and the number of working days in a month can be used to
predict the average monthly power consumption. Likewise , it can be concluded that
average monthly power consumption is linearly related to either ambient temperature
or number of working days in a month or both.
Multiple Correlation
The correlation between y and the combined predictors x1, x2, . . . , xk is
called the coefficient of multiple correlation and is denoted by , or
simply R.
12 | P a g e
ENGINEERING DATA ANALYSIS
The dot after y in the notation separates the dependent variable, y, from the
independent variables, x1, x2, . . . , xk.
For the two predictor case, is given by
where
and are correlation coefficients for the respective variables.
The multiple regression coefficient can assume values from 0 to 1, where 0 indicates
the absence of a linear multiple correlation between y and the independent variables
and 1 indicates a perfect linear multiple correlation in which all of the observed y’s
fall on the regression plane.
Coefficient of Multiple Determination. The proportion of variance in y
accounted for by the combined predictors x1, x2, . . . , xk is obtained by squaring
the multiple correlation coefficient and is called the coefficient of multiple
determination, R2.
This coefficient is an extension of the coefficient of determination for one predictor,
r2 discussed in Lesson 1.
For , =
√
For , =
√
For , =√
A comparison of the value of R2 with that for r2 indicates the improvement in
predicting y that can be achieved by using a multiple regression equation instead of a
one-predictor regression equation.
Table 1. Intercorrelation among the Variables
Variable
Variable y
y 1.000
1.000
1.000
13 | P a g e
ENGINEERING DATA ANALYSIS
The coefficient of multiple determination will be relatively large when the correlation
of each of the predictors with y is large and the correlations among the predictors are
0 or very small.
In fact, if the independent variables are uncorrelated,
If correlations exist among some or all of the independent variables, it is usually the
case that
The presence of nonzero correlations among the independent variables is referred to
as multicollinearity.
Extreme multicollinearity occurs when one independent variable is a linear function
of other independent variables; for example, x2 might equal 3x1, or might equal
In the latter case, the inclusion of in the regression equation would not account for
any variance in y not already accounted for by and Ideally, you would like to
have predictors that have high correlations with the dependent variable and zero
correlations with each other. Unfortunately in the behavioral sciences, health sciences,
and education, it is difficult to find predictors that meet these criteria. Once you have
found three or four good predictors, it is often difficult to find additional predictors
that are not highly correlated with at least one of the original predictors.
Application
1. Construct a table showing the intercorelation among the variables, average
monthly electric power consumption (y) , the ambient temperature ( and the
number of working days in a month ( in Example 1 and determine the
multiple coefficient of determination . Verify if there exist a multicollinearity
between the dependent and the independent variables.
Closure
Congratulations! You have successfully completed the tasks and activities
for Lesson 3. It is expected that your knowledge about correlation and regression will
surely help you in solving other real life problems or practical applications involving
predictions or estimation.
You are almost done with this module. The module summary and assessment will
follow.
14 | P a g e
ENGINEERING DATA ANALYSIS
SUMMARY
o A measure of the degree of linear relationship is called
correlation coefficient, r.
The value of r is a measure of the extent to which x and y are
linearly related
The value of r does not depend on the unit of
measurement for either variable
The value of r does not depend on which of the two
variables is considered x.
The value of r is between 1 and +1.
A correlation coefficient of r =1 occurs only when all
the points in a scatterplot of the data lie exactly on a
straight line that slopes upward. Similarly, r = 1 only
when
all the points lie exactly on a downward-sloping line.
o Regression Analysis is a statistical technique used for
determining the functional form of the relationship between
two or more variables, where one variable is called the
dependent or response variable and the rest are called the
independent or explanatory variables.
o The coefficient of determination, denoted by r 2, gives the
proportion of variation in y that can be attributed to an
approximate linear relationship between x and y.
15 | P a g e
ENGINEERING DATA ANALYSIS
MODULE ASSESSMENT
Solve the following problems.
1. Regression methods were used to analyze the data from a study investigating
the relationship between roadway surface temperature (x) and pavement
deflection ( y). The data follow :
Given the data above:
(a)estimate the intercept and slope regression coefficients. Write
the estimated regression line.
(b) Find the standard error of the slope and intercept coefficients.
(c) Compute the coefficient of determination, 2. Comment on the value.
(d) Use a t-test to test for significance of the intercept and slope coefficients
at = 0.05.
(e) Draw the regression line
2. The article “How to Optimize and Control the Wire Bonding Process”
described an experiment carried out to assess the impact of the variables
and temperature(degree Celsius), on ball bond shear strength(gm),y. The
following data were generated:
16 | P a g e
ENGINEERING DATA ANALYSIS
a. Find the regression equation representing the ball bond shear strength in terms
force and temperature.
b. Determine whether the data provide sufficient evidence to conclude that the
force and temperature are useful for predicting the ball bond shear strength at 5%
level of significance.
17 | P a g e
ENGINEERING DATA ANALYSIS
References
Broto, A.S. (2007). Simplified Approach to Inferential Statistics(1st ed.). National .
Philippines.
Carambas, Zenaida U(2011). Basic probability and Statistics. Valencia Educational
Supply. Baguio City
Peck, R., Olsen, C. and Devore, J.L. (2012): Introduction to Statistics and Data
Analysis(4th edition). Brooks/Cole/Cengage Learning, 20 Channel
Center Street Boston, MA 02210, USA
Ott, R.L., Longnecker, M. (2010). An Introduction to Statistical Methods and Data
Amalysis(6th ed). Brooks/Cole, Cengage Learning, CA, USA.
Raussas, George(2003). Introduction to Probability and Statistical Inference.
Elseviere Science, USA
Walpole, RE, & Myers, RH.(1993). Probability and Statistics for Engineers and (5th
ed.). Macmillan Publishing Company, New York.
Weiss, N.A. (2012). Elementary Statistics (8th ed.)Addison-Wesley. Pearson
Education, Inc. Boston, MA.
Woodbury, George(2002): An Introduction to Statistics(1st ed.) Thomson Learning,
Inc. Thomson Learning, USA
18 | P a g e
ENGINEERING DATA ANALYSIS
Appendix Table 1. F Distribution
19 | P a g e
ENGINEERING DATA ANALYSIS
20 | P a g e
ENGINEERING DATA ANALYSIS
21 | P a g e
ENGINEERING DATA ANALYSIS
22 | P a g e
ENGINEERING DATA ANALYSIS
23 | P a g e
ENGINEERING DATA ANALYSIS
24 | P a g e