Analysis of Data
Analysis of Data
Objectives
The objectives of this lesson are to:
z Concept of Multivariate Data Analysis
z Techniques of Multivariate Analysis
z Multiple Regression Analysis
z Discriminated Analysis
z Factor Analysis
z ANOVA
Structure:
4.1 Introduction
4.2 Multivariate Data Analysis
4.3 Multivariate Analysis Techniques
4.4 Multiple Regression Analysis
4.5 Discriminated Analysis
4.6 Factor Analysis
4.7 ANOVA
4.8 Summary
4.9 Self Assessment Questions
4.1 INTRODUCTION
Multivariate analysis is based in observation and analysis of more than one statistical outcome
variable at a time. In design and analysis, the technique is used to perform trade studies across multiple
dimensions while taking into account the effects of all variables on the responses of interest. The
development of multivariate methods emerged to analyze large databases and increasingly complex
data. Since the best way to represent the knowledge of reality is the modeling, we should use multivariate
statistical methods. Multivariate methods are designed to simultaneously analyze data sets, i.e., the
analysis of different variables for each person or object studied. Keep in mind at all times that all
variables must be treated accurately reflect the reality of the problem addressed. There are different
types of multivariate analysis and each one should be employed according to the type of variables to
analyze: dependent, interdependence and structural methods.
126 Research Methodology
Notes
4.2 MULTIVARIATE DATA ANALYSIS
Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which
involves observation and analysis of more than one statistical outcome variable at a time. In design and
analysis, the technique is used to perform trade studies across multiple dimensions while taking into
account the effects of all variables on the responses of interest. Uses for multivariate analysis include:
i) Design for capability (also known as capability-based design).
ii) Inverse design, where any variable can be treated as an independent variable.
iii) Analysis of Alternatives (AoA), the selection of concepts to fulfill a customer need.
iv) Analysis of concepts with respect to changing scenarios.
v) Identification of critical design drivers and correlations across hierarchical levels.
Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate
the effects of variables for a hierarchical "system-of-systems." Often, studies that wish to use multivariate
analysis are stalled by the dimensionality of the problem. These concerns are often eased through the
use of surrogate models, highly accurate approximations of the physics-based code. Since surrogate
models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for
large-scale MVA studies: while a Monte Carlo simulation across the design space is difficult with physics-
based codes, it becomes trivial when evaluating surrogate models, which often take the form of response
surface equations.
A linear relationship is assumed between the dependent variable and the independent variables. Notes
The residuals are homoscedastic and approximately rectangular-shaped.
Absence of multicollinearity is assumed in the model, meaning that the independent variables are
not too highly correlated.
At the center of the multiple linear regression analysis is the task of fitting a single line through a
scatter plot. More specifically the multiple linear regression fits a line through a multi-dimensional space
of data points. The simplest form has one dependent and two independent variables. The dependent
variable may also be referred to as the outcome variable or regressand. The independent variables may
also be referred to as the predictor variables or regressors.
There are 3 major uses for multiple linear regression analysis. First, it might be used to identify the
strength of the effect that the independent variables have on a dependent variable.
Second, it can be used to forecast effects or impacts of changes. That is, multiple linear regression
analysis helps us to understand how much will the dependent variable change when we change the
independent variables. For instance, a multiple linear regression can tell you how much GPA is expected
to increase (or decrease) for every one point increase (or decrease) in IQ.
Third, multiple linear regression analysis predicts trends and future values. The multiple linear
regression analysis can be used to get point estimates.
The Multiple Regression Model
In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
Notes Factor analysis attempts to identify underlying variables or factors, that explain the pattern of
correlations within a set of observed variables. Factor analysis is often used in data reduction to identify
a small number of factors that explain most of the variance that is observed in a much larger number of
manifest variables.
3. Cluster Analysis
A body of techniques with the purpose of classifying individuals or objects into a small number of
mutually exclusive groups, ensuring that there will be as much likeness within groups and as much
difference among groups as possible.
Cluster analysis is a collection of statistical methods which identifies groups of samples that behave
similarly or show similar characteristics. In common parlance it is also called look-a-like groups.
Concept of Cluster Analysis
Cluster analysis is a collection of statistical methods, which identifies groups of samples that behave
similarly or show similar characteristics. In common parlance it is also called look-a-like groups. The
simplest mechanism is to partition the samples using measurements that capture similarity or distance
between samples. In this way, clusters and groups are interchangeable words. Often in market research
studies, cluster analysis is also referred to as a segmentation method. In neural network concepts,
clustering method is called unsupervised learning. Typically in clustering methods, all the samples with in
a cluster is considered to be equally belonging to the cluster. If each observation has its unique probability
of belonging to a group and the application is interested more about these probabilities than we have to
use multinomial models.
Cluster analysis is a class of statistical techniques that can be applied to data that exhibit “natural”
groupings. Cluster analysis sorts through the raw data and groups them into clusters. A cluster is a group
of relatively homogeneous cases or observations. Objects in a cluster are similar to each other. They are
also dissimilar to objects outside the cluster, particularly objects in other clusters.
Explanation
Clustering and segmentation basically partition the database so that each partition or group is
similar according to some criteria or metric. Clustering according to similarity is a concept which appears
in many disciplines. If a measure of similarity is available there are a number of techniques for forming
clusters. Membership of groups can be based on the level of similarity between members and from this
the rules of membership can be defined. Another approach is to build set functions that measure some
Multivariate Analysis Techniques 129
property of partitions i.e. groups or subsets as functions of some parameter of the partition. This latter Notes
approach achieves what is known as optimal partitioning.
Many data mining applications make use of clustering according to similarity for example to segment
a client/customer base. Clustering according to optimization of set functions is used in data analysis e.g.
when setting insurance tariffs the customers can be segmented according to a number of parameters
and the optimal tariff segmentation achieved.
Clustering/segmentation in databases are the processes of separating a data set into components
that reflect a consistent pattern of behaviour. Once the patterns have been established they can then be
used to “deconstruct” data into more understandable subsets and also they provide sub-groups of a
population for further analysis or action which is important when dealing with very large databases. For
example, a database could be used for profile generation for target marketing where previous response
to mailing campaigns can be used to generate a profile of people who responded and this can be used to
predict response and filter mailing lists to achieve the best response.
Simple Cluster Analysis
In cases of one or two measures, a visual inspection of the data using a frequency polygon or
scatter plot often provides a clear picture of grouping possibilities. For example, the following is the
data from the “Example Assignment” of the cluster analysis homework assignment.
It is fairly clear from this picture that two subgroups, the first including X, Y, and Z and the second
including everyone else except describe the data fairly well. When faced with complex multivariate
data, such visualization procedures are not available and computer programs assist in assigning objects
to groups. The following text describes the logic involved in cluster analysis algorithms.
Steps in Doing a Cluster Analysis
A common approach to doing a cluster analysis is to first create a table of relative similarities or
differences between all objects and second to use this information to combine the objects into groups.
The table of relative similarities is called a proximities matrix. The method of combining objects into
groups is called a clustering algorithm. The idea is to combine objects that are similar to one another into
separate groups.
The Proximities Matrix
Cluster analysis starts with a data matrix, where objects are rows and observations are columns.
From this beginning, a table is constructed where objects are both rows and columns and the numbers in
130 Research Methodology
Notes the table are measures of similarity or differences between the two observations. For example, given
the following data matrix:
X1 X 2 X 3 X4 X5
O1
O2
O3
O4
A proximities matrix would appear as follows:
O1 O2 O3 O 4
O1
O2
O3
O4
The difference between a proximities matrix in cluster analysis and a correlation matrix is that a
correlation matrix contains similarities between variables (X1, X2) while the proximities matrix contains
similarities between observations (O1, O2).
The researcher has dual problems at this point. The first is a decision about what variables to
collect and include in the analysis. Selection of irrelevant measures will not aid in classification. For
example, including the number of legs an animal has would not help in differentiating cats and dogs,
although it would be very valuable in differentiating between spiders and insects.
The second problem is how to combine multiple measures into a single number, the similarity
between the two observations. This is the point where univariate and multivariate cluster analysis separate.
Univariate cluster analysis groups are based on a single measure, while multivariate cluster analysis is
based on multiple measures.
Univariate Measures
A simpler version of the problem of how to combine multiple measures into a measure of difference
between objects is how to combine a single observation into a measure of difference between objects.
Consider the following scores on a test for four students:
Student Score
X 11
Y 11
Z 13
A 18
The proximities matrix for these four students would appear as follows:
X Y Z A
X
Y
Z
A
Multivariate Analysis Techniques 131
The entries of this matrix will be described using a capital “D”, for distance with a subscript Notes
describing which row and column. For example, D34 would describe the entry in row 3, column 4, or in
this case, the intersection of Z and A.
One means of filling in the proximities matrix is to compute the absolute value of the difference
between scores. For example, the distance, D, between Z and A would be |13-18| or 5. Completing the
proximities matrix using the example data would result in the following:
X Y Z A
X 0 0 2 7
Y 0 0 2 7
Z 2 2 0 5
A 7 7 5 0
A second means of completing the proximities matrix is to use the squared difference between the
two measures. Using the example above D 34 , the distance between Z and A, would be (13-18)2 or 25.
This distance measure has the advantage of being consistent with many other statistical measures, such
as variance and the least squares criterion and will be used in the examples that follow. The example
proximities matrix using squared differences as the distance measure is presented below.
X Y Z A
X 0 0 4 49
Y 0 0 4 49
Z 4 4 0 25
A 49 49 25 0
Note that both example proximities matrices are symmetrical. Symmetrical means that row and
column entries can be interchanged or that the numbers are the same on each half of the matrix defined
by a diagonal running from top left to bottom right.
Other distance measures have been proposed and are available with statistical packages. For
example, SPSS/WIN provides the following options for distance measures.
Some of these options themselves contain options. For example, Minkowski and Customized are
really many different possible measures of distance.
Multivariate Measures
When more than one measure is obtained for each observation, then some method of combining
the proximities matrices for different measures must be found. Usually the matrices are summed in a
combined matrix. For example: given the following scores.
X1 X2
O1 25 11
O2 33 11
O3 34 13
O4 35 18
The two proximities matrices resulting from squared Euclidean distance that result could be summed
to produce a combined distance matrix.
132 Research Methodology
Notes O1 O2 O3 O4
O1 0 64 81 100
O2 64 0 1 4
O3 81 1 0 1
O4 100 4 1 0
+
O1 O2 O3 O4
O1 0 0 4 49
O2 0 0 4 49
O3 4 4 0 25
O4 49 49 25 0
=
O1 O2 O3 O4
O1 0 64 85 149
O2 64 0 5 53
O3 85 5 0 26
O4 149 53 26 0
Note that each corresponding cell is added. With more measures there are more matrices to be
added together.
This system works reasonably well if the measures share similar scales. One measure can overwhelm
the other if the measures use different scales. Consider the following scores.
X1 X2
O1 25 11
O2 33 21
O3 34 33
O4 35 48
The two proximities matrices resulting from squared Euclidean distance that result could be summed
to produce a combined distance matrix.
O1 O2 O3 O4
O1 0 64 81 100
O2 64 0 1 4
O3 81 1 0 1
O4 100 4 1 0
+
Multivariate Analysis Techniques 133
O1 O2 O3 O4 Notes
O1 0 100 484 49
O2 100 0 144 729
O3 484 144 0 225
O4 1369 729 225 0
=
O1 O2 O3 O4
O1 0 164 485 153
O2 164 0 145 733
O3 565 145 0 226
O4 1469 733 226 0
It can be seen that the second measure overwhelms the first in the combined matrix.
For this reason the measures are optionally transformed before they are combined. For example,
the previous data matrix might be converted to standard scores before computing the separated distance
matrices.
X1 X2 Z1 Z2
O1 25 11 -1.48 -1.08
O2 33 21 .27 -.45
O3 34 33 .49 .30
O4 35 48 .71 1.24
The two proximities matrices resulting from squared Euclidean distance that result from the standard
scores could be summed to produce a combined distance matrix.
O1 O2 O3 O4
O1 0 3.06 3.88 4.80
O2 3.06 0 .05 .19
O3 3.88 .05 0 .05
O4 4.80 .19 .05 0
+
O1 O2 O3 O4
O1 0 .40 1.90 5.38
O2 .40 0 .56 2.86
O3 1.9 .56 0 .88
O4 5.38 2.86 .88 0
=
134 Research Methodology
Notes O1 O2 O3 O4
O1 0 3.46 5.78 10.18
O2 3.46 0 .61 3.05
O3 5.78 .61 0 .93
O4 10.18 3.05 .93 0
The point is that the choice of whether to transform the data and the choice of distance metric can
result in vastly different proximities matrices.
4. Multidimensional Scaling
A statistical technique that measures objects in multidimensional space on the basis of respondents’
judgments of the similarity of objects.
5. Multivariate Analysis of Variance (MANOVA)
A statistical technique that provides a simultaneous significance test of mean difference between
groups for two or more dependent variables.
vii) Group sizes of the dependent should not be grossly different and should be at least times the Notes
number of independent variables.
4.7 ANOVA
Analysis of Variance (ANOVA) is a collection of statistical models and their associated procedures,
in which the observed variance in a particular variable is partitioned into components attributable to
different sources of variation. In its simplest form ANOVA provides a statistical test of whether or not
the means of several groups are all equal, and therefore generalizes t-test to more than two groups.
Doing multiple two-sample t-tests would result in an increased chance of committing a type I error. For
this reason, ANOVAs are useful in comparing two, three or more means.
An important technique for analyzing the effect of categorical factors on a response is to perform
an Analysis of Variance. An ANOVA decomposes the variability in the response variable amongst the
different factors. Depending upon the type of analysis, it may be important to determine: (a) which
136 Research Methodology
Notes factors have a significant effect on the response, and/or (b) how much of the variability in the response
variable is attributable to each factor.
Statgraphics Centurion provides several procedures for performing an analysis of variance:
1. One-Way ANOVA - used when there is only a single categorical factor. This is equivalent to
comparing multiple groups of data.
2. Multifactor ANOVA - used when there is more than one categorical factor, arranged in a
crossed pattern. When factors are crossed, the levels of one factor appear at more than one
level of the other factors.
3. Variance Components Analysis - used when there are multiple factors, arranged in a hierarchical
manner. In such a design, each factor is nested in the factor above it.
4. General Linear Models - used whenever there are both crossed and nested factors, when
some factors are fixed and some are random, and when both categorical and quantitative
factors are present.
One-Way ANOVA
A one-way analysis of variance is used when the data are divided into groups according to only one
factor. The questions of interest are usually: (a) Is there a significant difference between the groups and
(b) If so, which groups are significantly different from which others? Statistical tests are provided to
compare group means, group medians, and group standard deviations. When comparing means, multiple
range tests are used, the most popular of which is Tukey's HSD procedure. For equal size samples,
significant group differences can be determined by examining the means plot and identifying those
intervals that do not overlap.
Multifactor ANOVA
When more than one factor is present and the factors are crossed, a multifactor ANOVA is
appropriate. Both main effects and interactions between the factors may be estimated. The output
includes an ANOVA table and a new graphical ANOVA from the latest edition of Statistics for
Experimenters by Box, Hunter and Hunter (Wiley, 2005). In a graphical ANOVA, the points are scaled
so that any levels that differ by more than exhibited in the distribution of the residuals are significantly
different.
Variance Components Analysis
A Variance Components Analysis is most commonly used to determine the level at which variability
is being introduced into a product. A typical experiment might select several batches, several samples
from each batch and then run replicates tests on each sample. The goal is to determine the relative
percentages of the overall process variability that is being introduced at each level.
Assumptions of ANOVA
The analysis of variance has been studied from several approaches, the most common of which
use a linear model that relates the response to the treatments and blocks. Even when the statistical
model is nonlinear, it can be approximated by a linear model for which an analysis of variance may be
appropriate.
(1) The model is correctly specified.
(2) The Hij’s are normally distributed.
(3) The Hij’s have mean zero and a common variance, V 2 .
With multiple populations, detection of violations of these assumptions requires examining the residuals Notes
rather than the Y-values themselves.
Illustration - 1
The following are measurements of performance obtained after training 4 groups by different
methods:
Method 1: 17 19 18 15 21 19 16 14
Method 2: 21 23 20 19 19
Method 3: 20 16 21 17 19 16 16
Method 4: 13 15 16 17 13 16
Find out whether there is a significant overall differences between these 4 groups in terms of their
performance after training ( D = 0.05).
Solution:
Let, the null hypothesis be that different methods of training do not result difference in performance
after training.
1 2 3 4
17 21 20 13
19 23 16 15
18 20 21 16
15 19 17 17
21 19 19 13
19 16 16
16 16
14
By coding of data (i.e., add, subtract, multiply or divide all observations by a number), can
simplify the task. Let us subtract 15 from all observations, we get
1 2 3 4
2 6 5 -2
4 8 1 0
3 5 6 1
0 4 2 2
6 4 4 -2
4 1 1
1 1
-1
T1 = 19 n1 = 8
T2 = 27 n2 = 5
138 Research Methodology
Notes T3 = 20 n3 = 7
T4 = 0 n4 = 6
T = 66 N = 26
T2
Correction factor = where T = total of all observations
N
= no. of all observations
66 2
= 167.54
26
Sum of squares between samples:
Tj2 T2
SSB = ¦n j
N
F 19 2
272 202 02 I
= G 8 JK – 167.54
H 5 7 6
¦T 2
¦
j
SSW = X 2ij
n2
= (22 + 42 + 02 + 62 + 42 + 12 + 12 + 62 + 82 + 52 + 42 + 42 + 52 + 12 + 62 +
F 19
2
272 202 02 I
22 + 42 + 12 + 12 + 22 + 02 + 12 + 22 + 22 + 12) GH 8
5
7
6 JK
= 338 – 248.07 = 89.93
ANOVA Table
80.525 26.84
Between samples 80.525 (k – 1) = (4 – 1) = 3 = 26.84 = 6.56
3 4.09
89.930
Within samples 89.930 (n – k) = (26 – 4) = 22 = 4.09
22
Total 170.455 (n – 1) = (26 – 1) = 25
F-ratio calculated = 6.56
F-ratio from table for v1 = 3 and v2 = 22 at 5% level of significance is 3.05
Since, Fcalculated > Ftable , to reject the null hypothesis, which means there is a significant
overall difference between 4 groups in terms of performance after training.
Multivariate Analysis Techniques 139
Illustration - 2 Notes
Three methods are used in the production process test. At 5% level of significance test
whether the three methods can be considered to be equivalent as far as output are concerned.
Method I 70 72 75 80 53
Method II 100 110 108 112 120 107
Method III 60 65 57 84 87 73
Solution:
Let the null hypothesis be that there is no significant difference between the three methods.
Method I II III
70 100 60
72 110 65
75 108 57
80 112 84
53 120 87
107 73
T2
Correction factor =
N
where, T - sum of all observations
N - no. of observations
Here, T1 = 350 T2 = 657. T3 = 426, T = 1433
n1 = 5 n2 = 6, n3 = 6 N = 17
Sum of squares between samples:
Tj2 T2
SSB = ¦n j
N
F 350 2
657 2 426 2 I F 1433 I 2
= GH 5
6
6 JK GH 17 JK
= 24,500 + 71,941.5 + 30,246 – 1,20,793.5 = 5894
Sum of squares within samples:
¦T 2
¦
j
SSW = X 2ij
n2
= (702 + 722 + 752 + 802 + 832 + 1002 + 1102 + 1082 + 1122 + 1202 + 1072 +
F 350 2
657 2 426 2 I
602 + 652 + 572 + 842 +872 + 732 GH 5
6
6 JK
= 1,32,183 – 1,26,687.50 = 5195.5
140 Research Methodology
5894 2954
Between samples 5894 (k – 1) = (3 – 1) = 2 = 2947 = 7.51
2 392.54
5495.5
Within samples 5495.5 (n – k) = (17 – 3) = 14 = 392.54
14
6279.00 (n – 1) = (17 – 1) = 16
F-ratio calculated = 32.4
F-ratio from table for v1 = 2 and v2 = 14 at 5% level C1 significance = 3.74
Since Fcalculated > Ftable , reject the null hypothesis which means there is a significant
difference between the three methods.
Illustration - 3
The following table gives the monthly sales in rupees (in thousands) of a certain firm in three
different states of 4 different salesmen.
Salesmen
States 1 2 3 4
A 10 8 8 14
B 14 16 10 8
C 18 12 12 14
Test whether:
i. Sales between salesmen are significant
ii. Sales between states are significant.
Solution:
Two Way ANOVA:
Let, the first null hypothesis be that sales between salesmen are insignificant and second null
hypothesis be that sales between states are in significant.
i.e., H0 (1) : Sales between salesmen are insignificant
H0 (2) : Sales between states are insignificant’
By coding the data, we can simplify the task. Let us subtract 12 from all the observations and
we get:
Salesmen Total
2 4 4 2 8
State 2 4 2 4 0
6 0 0 2 8
Total 6 0 6 0 0
Multivariate Analysis Techniques 141
T2 02
= =0
N 12
Where, T - total of all samples
N - no. of samples
Total sum of squares:
Tj2
SST = ¦ X 2ij
N
= (22 + 22 + 62 + 42 + 42 + 02 + 42 + 22 + 02 + 22 + 42 + 22) – 0 = 120
Sum of squares between columns (i.e., between salesmen):
Tj2 T2 F6 2
02 ( 6)2 02 I
SSC = ¦n j
N = GH 3
3
3
3 JK – 0 = 24
Sum of squares between rows (i.e., between states):
Ti2 T2 F
( 8) 2 02 82 I
SSR = ¦n i
N
= GH 4
4
4 JK – 0 = 32
Sum of squares of residual or error:
SSres = SST – (SSC + SSR) = 120 – (24 + 32) = 64
ANOVA Table
24 10.67
Between 24 (c – 1) = (4 – 1) = 3 =8 = 1.33
3 8
samples
32 16
Between States 32 (r – 1) = (3 – 1) = 2 = 16 = 1.50
2 10.67
64
Residual or error 64 (c – 1) (r – 1) = (3) (2) = 6 = 10.67
6
Total 120 (n – 1) = (12 – 1) = 11
Note:
Greater var iance
F-ratio = Smaller var iance
Notes Hence, conclude that null hypothesis holds good and there is no significant difference between
the salesmen.
ii) Calculated F(2, 6) = 1.5 < Table F(2, 6) = 5.14
Hence null hypothesis is accepted and conclude the there is no significant difference between
the states
Illustration - 4
The following table shows the lifetimes in hours of samples from three different types of
television tables manufactured by a company. Determine whether there is a difference between the
three types of significance level of 0.01.
Ti Ti2 Ti2 /x
S1 1 5 3 9 81 27
S2 –2 0 2 –1 –4 –5 25 5
S3 4 2 0 2 8 64 16
T = 12 48
T2 122
CF = = = 12
N 12
SS = 6 6 x 2ij – CF = 12 + 52 + 32 + 22 + 22 + 22 + 12 + 42 + 42 + 22 + 22 – 12
= 1 + 25 + 9 + 4 + 4 + 4 + 1 + 16 + 16 + 4 + 4 – 12 = 76
Ti2
SSR = 6 CF
n
= 48 – 12 = 36
SSE = SS – SSR = 76 – 36 = 40
ANOVA Table
SV SS df MS F ratio
B/w rows 36 2 18 F = 4.0909
Error 40 9 44
F(2, 5) Table value = 8.02
? F < FD
Accept H0
i.e., there is no significant differences between the 3 samples.
Multivariate Analysis Techniques 143
Illustration - 5 Notes
A research company has designed three different systems to clear up oil spills. The following
table contains the results, measured by how much surface area (in square meters) is cleared in 1
hour. The data were found by testing each method in several trials. Are the three systems equally
effective? Use the 0.05 level of significance.
System A: 55 60 63 56 59 55
System B: 57 53 64 49 62
System C: 66 52 61 57
Solution:
Let, us change the origin
X – 55
Ti Ti2 Ti2 /n
System A: 0 5 8 1 4 0 18 324 54
System B: 2 2 9 6 7 10 100 20
System C: 11 3 6 2 16 256 64
44 138
T2 44 2 1936
CF = = = = 129.07
N 15 15
SS = 6 6 x 2ij – CF
= 25 + 64 + 1 + 16 + 4 + 4 + 81 + 36 + 49 + 121 + 9 + 36 + 4 – 129.07
= 450 – 129.7 = 320.93
6Ti2
= SSR = CF
n
= 138 – 129.07 = 8.93
SSE = SS – SSk
= 320.93 – 8.93 = 312
ANOVA Table
SV SS df MS F ratio
B/w system 8.93 2 4.465 F = 5.823
F < FD
Accept H0
i.e., There is no significant difference between the system
144 Research Methodology
Notes Illustration - 6
The following table shows the yields per acre of four different plant crops grown on lots
treated with three different types of fertilizer. Determine at the 0.05 significance level whether
there is a difference in yield per acre
i) due to the fertilizers and
ii) due to the crops
Tj2
122.88 147 168.75 119.07 557.7
n
C.F =
T2
=
b81.6g 2
=
6658.56
= 554.88
N 12 12
6Ti2
SSR = – CF = 568.56 – 554.88 = 13.68
n
Tj2
SSC = 6 – CF = 557.7 – 554.88 = 2.82
k
SSE = SS – SSR – SSC = 23.08 – 13.68 – 2.82 = 6.58
ANOVA Table
SV SS df MS
B/w Rows 13.68 2 6.84 F1 = 6.218
FD 2 = 8.94
Mechanics Machine
A B C
1 44 48 38
2 37 40 36
3 45 38 32
4 40 44 44
Test whether:
(i) Mean productivity is same for machines.
(ii) Mean productivity is same for mechanics.
Solution:
A 2-way ANOVA technique will enable us to solve and answer the question asked.
Let us take null hypothesis that
i) There is no significant difference between the machines productivity.
ii) There is no significant difference between the mechanics productivity.
Let us code the data by subtracting 40 from all observations to simplify the task.
Machines Total
4 8 2 10
Mechanics 3 0 4 7
5 2 8 5
0 4 4 8
Total 6 10 10 6
Correction factor:
T2 62
= =3
N 12
Where, T - total of all observations
N - No. of observations
146 Research Methodology
Tj2
SST = ¦ X 2ij
N
= (42 + 82 + 22 + 32 + 0 + 42 + 52 + 22 + 82 + 02 + 42 + 42) – 3 = 231
Sum of squares between columns (i.e., between machines):
Tj2 T2 F6 2
102 ( 10) 2 I
SSC = ¦n j
N = GH 4
4
4 JK 3 = 56
F T I T F 10 2 2 2
( 7) 2 ( 5)2 82 I
SSR = ¦ G n J N = G 3 JK = 76.33
i
H K H i 3 3 3
56 28
Between machines 56 (c – 1) = 2 = 28 = 1.7
2 16.45
76.33 25.44
Between mechanics 76.33 (r – 1) = 3 = 25.44 = 1.55
3 16.45
98.67
Residual or error 98.67 (c – 1) (r – 1) = 6 = 16.45
6
Total 231 (n – 1) = 11
Table values of F ratio at 5% level of significance:
F(2, 6) = 5.14
F(3, 6) = 4.76
(i) Calculated F(2, 6) = 1.7 < Table F(2, 6) = 5.14.
Hence, null hypothesis is accepted i.e., there is no significant difference between machines
which means the mean productivity is same for machines.
(ii) Calculated F(3, 6) = 1.55 < Table F(3, 6) = 4.76.
Hence, null hypothesis is accepted i.e., there is no significant difference between mechanics which
means the mean productivity is same for mechanics.
Illustration - 8
Set up an ANOVA table for the following information relating to three drugs testing to judge the
effectiveness in reducing blood pressure for three different groups of people.
Multivariate Analysis Techniques 147
Drug Total
Groups of people 14 10 11 70
15 9 11
12 7 10 59
11 8 11
10 11 8 58
11 11 7
Total 73 56 58 187
Correction factor:
T2 1872
= = 1942.72
N 18
Total sum of squares:
T2
SSC = ¦ X 2ij
N
= (142 + 152 + 122 + 112 + 102 + 92 + 102 + 92 + 72 + 82 + 112 + 112 + 112
+ 112 + 102 + 112 + 82 + 72) – 1942.72
= 76.28
148 Research Methodology
T2
SST = X 2ij
N
= (142 + 152 + 122 + 112 + 102 + 92 + 102 + 92 + 72 + 82 + 112 + 112 + 112
+ 112 + 102 + 112 + 82 + 72) – 1942.72 = 76.28
Sum of squares between rows (i.e., between people):
Ti2 T 2 F 70 2
592 582 I
= G 6 JK – 1942.72 = 14.78
SSR =
ni N H 6 6
dX i 2
SSW = ij Xw where X w – mean within samples
= (14 – 14.5)2 + (15 – 14.5)2 + (10 – 9.5)2 + (9 – 9.5)2 + (11 – 11)2 + (11 – 11)2 +
(12 – 11.5)2 + (11 – 11.5)2 + (7 – 7.5)2 + (8 – 7.5)2 + (10 – 10.5)2 + (11 – 10.5)2
+ (10 – 10.5)2 + (11 – 10.5)2 + (11 – 11)2 + (11 – 11)2 + (8 – 7.5)2 + (7 – 7.5)2
= 3.50
Sum of squares for interaction variation:
SSI = SST – (SSC + SSR + SSW) = 76.28 – (28.77 + (14.78 + 3.50) = 29.23
ANOVA Table
28.77 14.385
Between Drugs 28.77 (c – 1) = 2 = 14.385 = 36.9
2 0.389
14.78 7.390
Between groups 14.78 (r – 1) = 2 = 7.390 = 19.0 of people
2 0.389
29.23 7.308
Interaction 29.23 17 – 2 – 2 – 9 = 4 = 7.308 = 18.8
4 0.389
3.5
Within samples 3.50 (n – rc) = 9 = 0.389 (error)
9
Total 76.28 (n – 1) = 17
Table value of F-ratios at 5% level of significance F(2, 9) = 4.26; F(2, 9) = 3.63
i) Calculated F(2, 9) = 36.9 > Table F(2, 9) = 4.26.
Hence, null hypothesis is rejected which means the drugs act differently.
ii) Calculated F(2, 9) = 19.0 > Table F(2, 9) = 4.26.
Hence, null hypothesis is rejected which means the different groups of people are affected
differently.
Multivariate Analysis Techniques 149
Brands of gasoline
W X Y Z
Cars A 13 12 12 11
11 10 11 13
B 12 10 11 9
13 11 12 10
C 14 11 13 10
13 10 14 8
Solution:
T2
Correction factor
N
T - Sum of all observations
N - No. of all observations.
W X Y Z Total
A 13 12 12 11 93
11 10 11 13
B 12 10 11 9 88
13 11 12 10
C 14 11 13 10 93
13 10 14 8
Total 76 64 73 61 274
Here,
T1 = 76, T2 = 64, T3 = 73, T4 = 61, T = 274
n1 = 6, n2 = 6, n3 = 6, n4 = 6, N = 24
T2 274 2
Correction factor = = = 3128.17
N 24
Total sum of squires:
T2
SST = ¦ X 2ij
N
= (132 + 112 + 122 + 132 + 122 + 102 + 102 + 112 + 102 + 122 + 112 + 112 +
122 + 132 + 142 + 112 + 132 + 92 + 102 + 102 + 82) – 3128.17
= 3184 – 3128.17 = 55.83
150 Research Methodology
Tj2 T2 F
762 64 2 732 612 I
SSC = ¦n j
N = 6
6GH
6
6 JK – 3128.17
= 3,153.67 – 3,128.17 = 25.50
Sum of squares between rows (i.e., between cars):
¦ dX i 2
SSW = ij Xw
= (13 – 12)2 + (11 – 12)2 + (12 – 12.5)2 + (13 – 12.5)2 + (14 – 13.5)2 + (12 – 11)2 + (10 – 11)2
+ (10 – 10.5)2 + (11 – 10.5)2 + (10- 10.5)2 + (12 – 11.5)2 + (11 – 11.5)2 + (11 – 11.5)2 + (12 –
11.5)2 + (13 – 13.5)2 + (14 – 13.5)2 + (11 – 12)2 + (13 – 12)2 + (9 – 9.5)2 + (10 – 9.5)2 + (10 –
9)2 + (8 – 9)2 = 12
Sum of squares for interaction variation:
SSI = SST – (SSC + SSR + SSW) = 55.83 – (25.50 + 2.08 + 12) = 16.25
4.8 SUMMARY
Multivariate analysis is based in observation and analysis of more than one statistical outcome
variable at a time. In design and analysis, the technique is used to perform trade studies across multiple
dimensions while taking into account the effects of all variables on the responses of interest.
Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which
involves observation and analysis of more than one statistical outcome variable at a time. In design and
analysis, the technique is used to perform trade studies across multiple dimensions while taking into
account the effects of all variables on the responses of interest.
Multivariate analysis techniques which can be conveniently classified into two broad categories
viz., dependence methods and interdependence methods.
Multiple regression is the most commonly utilized multivariate technique. It examines the relationship
between a single metric dependent variable and two or more metric independent variables.
Discriminant analysis is the regression based statistical technique that is used in determining the
particular classification or group for an item of data or an object belongs to on the basis of its characteristics
or essential features. It differs from group building techniques such as cluster analysis in that the
classifications or groups to choose from must be known in advance.
Cluster analysis is a collection of statistical methods which identifies groups of samples that behave
similarly or show similar characteristics. In common parlance it is also called look-a-like groups.
A statistical technique that measures objects in multidimensional space on the basis of respondents’
judgments of the similarity of objects.
Multivariate Analysis Techniques 151
A statistical technique that provides a simultaneous significance test of mean difference between Notes
groups for two or more dependent variables.
Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of
correlations within a set of observed variables. Factor analysis is often used in data reduction to identify
a small number of factors that explain most of the variance that is observed in a much larger number of
manifest variables. Factor analysis can also be used to generate hypotheses regarding causal mechanisms
or to screen variables for subsequent analysis.
Analysis of Variance (ANOVA) is a collection of statistical models and their associated procedures,
in which the observed variance in a particular variable is partitioned into components attributable to
different sources of variation.