Module-3
1.Frequency Distribution,
2.Cross-Tabulation, and
3.Hypothesis Testing
Frequency Distribution
• In a frequency distribution, one variable is
considered at a time.
• A frequency distribution for a variable produces a
table of frequency counts, percentages, and
cumulative percentages for all the values associated
with that variable.
Internet Usage Data
Respondent Sex Familiarity Internet Attitude Toward Usage of Internet
Number Usage Internet Technology Shopping Banking
1 1.00 7.00 14.00 7.00 6.00 1.00 1.00
2 2.00 2.00 2.00 3.00 3.00 2.00 2.00
3 2.00 3.00 3.00 4.00 3.00 1.00 2.00
4 2.00 3.00 3.00 7.00 5.00 1.00 2.00
5 1.00 7.00 13.00 7.00 7.00 1.00 1.00
6 2.00 4.00 6.00 5.00 4.00 1.00 2.00
7 2.00 2.00 2.00 4.00 5.00 2.00 2.00
8 2.00 3.00 6.00 5.00 4.00 2.00 2.00
9 2.00 3.00 6.00 6.00 4.00 1.00 2.00
10 1.00 9.00 15.00 7.00 6.00 1.00 2.00
11 2.00 4.00 3.00 4.00 3.00 2.00 2.00
12 2.00 5.00 4.00 6.00 4.00 2.00 2.00
13 1.00 6.00 9.00 6.00 5.00 2.00 1.00
14 1.00 6.00 8.00 3.00 2.00 2.00 2.00
15 1.00 6.00 5.00 5.00 4.00 1.00 2.00
16 2.00 4.00 3.00 4.00 3.00 2.00 2.00
17 1.00 6.00 9.00 5.00 3.00 1.00 1.00
18 1.00 4.00 4.00 5.00 4.00 1.00 2.00
19 1.00 7.00 14.00 6.00 6.00 1.00 1.00
20 2.00 6.00 6.00 6.00 4.00 2.00 2.00
21 1.00 6.00 9.00 4.00 2.00 2.00 2.00
22 1.00 5.00 5.00 5.00 4.00 2.00 1.00
23 2.00 3.00 2.00 4.00 2.00 2.00 2.00
24 1.00 7.00 15.00 6.00 6.00 1.00 1.00
25 2.00 6.00 6.00 5.00 3.00 1.00 2.00
26 1.00 6.00 13.00 6.00 6.00 1.00 1.00
27 2.00 5.00 4.00 5.00 5.00 1.00 1.00
28 2.00 4.00 2.00 3.00 2.00 2.00 2.00
29 1.00 4.00 4.00 5.00 3.00 1.00 2.00
30 1.00 3.00 3.00 7.00 5.00 1.00 2.00
Frequency of Familiarity with the Internet
Valid Cumulative
Value label Value Frequency (n) Percentage Percentage Percentage
Not so familiar 1 0 0.0 0.0 0.0
2 2 6.7 6.9 6.9
3 6 20.0 20.7 27.6
4 6 20.0 20.7 48.3
5 3 10.0 10.3 58.6
6 8 26.7 27.6 86.2
Very familiar 7 4 13.3 13.8 100.0
Missing 9 1 3.3
TOTAL 30 100.0 100.0
Frequency Histogram
8
7
6
5
Frequency
4
3
2
1
0
2 3 4 5 6 7
Familiarity
Statistics Associated with Frequency Distribution:
Measures of Location
• The mean, or average value, is the most commonly used measure of
central tendency. The mean, X ,is given by
n
X=Si=1
X i /n
Where,
Xi = Observed values of the variable X
n = Number of observations (sample size)
• The mode is the value that occurs most frequently. It represents the
highest peak of the distribution. The mode is a good measure of
location when the variable is inherently categorical or has otherwise
been grouped into categories.
Statistics Associated with Frequency Distribution:
Measures of Location
• The median of a sample is the middle value when the data are
arranged in ascending or descending order. If the number of data
points is even, the median is usually estimated as the midpoint between
the two middle values – by adding the two middle values and dividing
their sum by 2. The median is the 50th percentile.
Statistics Associated with Frequency Distribution:
Measures of Variability
• The range measures the spread of the data. It is simply the
difference between the largest and smallest values in the sample.
Range = Xlargest – Xsmallest
• The interquartile range is the difference between the 75th and
25th percentile. For a set of data points arranged in order of
magnitude, the pth percentile is the value that has p% of the data
points below it and (100 - p)% above it.
Statistics Associated with Frequency Distribution:
Measures of Variability
• The variance is the mean squared deviation from the mean. The
variance can never be negative.
• The standard deviation is the square root of the variance.
n
(Xi - X)2
sx = S
i =1 n - 1
• The coefficient of variation is the ratio of the standard deviation to the
mean expressed as a percentage, and is a unitless measure of relative
variability. CV = sx / X
Statistics Associated with Frequency Distribution:
Measures of Shape
• Skewness is the tendency of the deviations from the mean to be larger
in one direction than in the other. It can be thought of as the tendency
for one tail of the distribution to be heavier than the other.
• Kurtosis is a measure of the relative peakedness or flatness of the curve
defined by the frequency distribution. The kurtosis of a normal
distribution is zero. If the kurtosis is positive, then the distribution is
more peaked than a normal distribution. A negative value means that
the distribution is flatter than a normal distribution.
Skewness of a Distribution
Symmetric Distribution
Skewed Distribution
Mean
Median
Mode
(a)
Mean Median Mode
(b)
Cross-Tabulation
• While a frequency distribution describes one variable at a
time, a cross-tabulation describes two or more variables
simultaneously.
• Cross-tabulation results in tables that reflect the joint
distribution of two or more variables with a limited number of
categories or distinct values.
Gender and Internet Usage
Gender
Row
Internet Usage Male Female Total
Light (1) 5 10 15
Heavy (2) 10 5 15
Column Total 15 15
Two Variables Cross-Tabulation
• Since two variables have been cross-classified, percentages could be
computed either columnwise, based on column totals or rowwise,
based on row totals.
• The general rule is to compute the percentages in the direction of the
independent variable, across the dependent variable. The correct
way of calculating percentages is as shown in Table.
Internet Usage by Gender
Gender
Internet Usage Male Female
Light 33.3% 66.7%
Heavy 66.7% 33.3%
Column total 100% 100%
Gender by Internet Usage
Internet Usage
Gender Light Heavy Total
Male 33.3% 66.7% 100.0%
Female 66.7% 33.3% 100.0%
Steps Involved in Hypothesis Testing
Formulate H0 and H1
Select Appropriate Test
Choose Level of Significance
Collect Data and Calculate Test Statistic
Determine Probability Determine Critical Value
Associated with Test Statistic of Test Statistic TSCR
Determine if TSCAL falls
Compare with Level of into (Non) Rejection
Significance, Region
Reject or Do not Reject H0
Draw Marketing Research Conclusion
A Classification of Statistical Techniques
Statistical Techniques
Parametric Tests Non-parametric Tests
(Metric Tests) (Nonmetric Tests)
One Sample Two or More One Sample Two or More
Samples Samples
* t test * Chi-Square
* Z test * K-S
* Runs
* Binomial
Independent Paired
Samples Samples Independent Paired
Samples Samples
* Two-Group t * Paired
test * Chi-Square * Sign
t test * Mann-Whitney * Wilcoxon
* Z test
* Median * McNemar
* K-S * Chi-Square
Univariate Techniques
A. Parametric Tests
B. Non-Parametric Tests
Univariate Techniques
A. Parametric Tests
One sample:
1. t-test,
2. z-test
B. Non-Parametric Tests
One sample:
1. Chi-square test
2. K-S test
3. Runs test
4. Binomial test
Bivariate Techniques
A. Parametric Tests
1. Independent t-test
2. Paired t- test
3. ANOVA
4. Repeated Measures ANOVA
Bivariate Techniques
B. Non-Parametric Tests
1. Mann-Whitney test
2. Wilcoxon Rank Signed test
3. Kruskal Wallis test
4. Freidman ANOVA test
Multivariate techniques
All statistical methods that simultaneously
analyze multiple measurements on each
individual or object under investigation.
Types of Multivariate Techniques
• Dependence techniques: A variable or set of variables is identified as the dependent
variable to be predicted or explained by other variables known as independent
variables.
o Multiple Regression
o Multiple Discriminant Analysis
o Logit/Logistic Regression
o Multivariate Analysis of Variance (MANOVA) and Covariance
o Conjoint Analysis
o Canonical Correlation
o Structural Equations Modeling (SEM)
25
Types of Multivariate Techniques
• Interdependence techniques: Involve the simultaneous analysis of all variables in
the set, without distinction between dependent variables and independent variables.
– Principal Components and Common Factor Analysis
– Cluster Analysis
– Multidimensional Scaling (perceptual mapping)
– Correspondence Analysis
26
One Sample t-test
One Sample t-test
Hypothesis Testing:
• Unknown Parameters Requires t-test
• Comparison of One Sample Mean to a Specific Value
The t-statistic
• This value is used just like a z-statistic: if the value of t exceeds some
threshold or critical valued, t , then an effect is detected (i.e., the
hypothesis of no difference is rejected)
• Critical values t are found in Table
Finding Critical Values
A portion of the t- distribution table
The one-sample t-test for a population mean
Step 1: Set the null and alternate hypothesis
The null hypothesis is H0: = 0 (the real mean equals some proposed theoretical
constant 0);
The alternative hypothesis is one of the following:
H1: 0 H1: < 0 H1: > 0
(Two Tailed) (Left Tailed) (Right Tailed)
Step 2: Determine the appropriate statistical test
Step-3: Set the level of significance
The level of significance, that is, is set as 0.05
Step-4: Set the decision rule
±t/2 -t +t
(Two Tailed) (Left Tailed) (Right Tailed
with d.f = n - 1. So if the computed t value is outside the ±t/2 range, the
null hypothesis will be rejected, otherwise, it is accepted.
The one-sample t-test for a population mean
Step-5 : Collect the sample data
Step-6: Analyses the data
Step-7: Statistical conclusion and business implication
If the value of the test statistic falls in the rejection
region, reject H0, otherwise do not reject H0.
Criterion for deciding whether or not to
reject the null hypothesis
Examples of t-test
Ex.-1: Apollo Tyres has launched a new brand of tyres for
tractors and claims that under normal circumstances the
average life of tyres is 40,000 km. a retailers wants to test
this claims and has taken a random samples of 8 tyres. He
tests the life of the tyres under normal circumstances. The
results obtained are presented in the table.
Life of the sample tyres
Tyres 1 2 3 4 5 6 7 8
Km. 35,000 38,000 42,000 41,000 39,000 41,500 43,000 38,500
Solution:
Step-1 : Set the null and alternate hypothesis
H0: = 40,000
H1: 40,000
Step-2 : Select the appropriate formula for the t statistic
Step-3: Set the level of significance
The confidence level is taken as 95% ( 𝛼 = 0.05 ).
Step-4: Set the decision rule
±t/2 -t +t
(Two Tailed) (Left Tailed) (Right Tailed
with d.f = n - 1. So if the computed t value is outside the ±t/2 range, the
null hypothesis will be rejected, otherwise, it is accepted.
Step-5 : Collect the sample data
Life of Tyres
Tyres Km.
1 35,000 - 4750 22562500
2 38,000 - 1750 3062500
3 42,000 2250 5062500
4 41,000 1250 1562500
5 39,000 - 750 562500
6 41,500 1750 3062500
7 43,000 3250 10562500
8 38,500 -1250 1562500
Total = 39,750 48000000
Sample Standard deviation =
= 2595.18
Sample mean , = 39,750
Population mean , = 40,000
Standard deviation of sample mean, = 917.53
Step-6: Analyses the data
t = (39,750-40,000)/ 917.53
= -250/ 917.53
t = -0.272
The degrees of freedom for the t statistic to test the hypothesis about one mean
are n - 1. In this case, n - 1 = 8 - 1 or 7. From Table in the Statistical Appendix,
the critical value is 2.365 . The observed t-value is -0.272 which falls the
acceptance region. This implies that the evidence from the sample is not
sufficient to reject the null hypothesis is that the population mean is 40,000 km.
Step -7 : Statistical conclusion and business implication
Hence, the null hypothesis is accepted. The retailer can quite
convincingly tell customers that the company’s claim is under
normal condition.
Two Sample t-test
Two Sample t-test: Hypothesis of Differences
Between Two Groups
1. Is Group “A” Different Than Group “B”?
2. Does an Experimental Manipulation Have an Effect?
• Is an experimental group different than a control group?
• If so, then the experimental manipulation had an effect
The Basic Idea…
The basic t-test has the form:
Observed mean - Hypothesiz ed mean
t
Standard error
The t-test for two population means
Step-1 : Set the null and alternate hypothesis
The null hypothesis is H0: 1 = 2 or 1 - 2 = 0;
The alternative hypothesis is one of the following:
H1: 1 2 H1: 1 < 2 H1: 1 > 2
(Two Tailed) (Left Tailed) (Right Tailed)
Step-2 : Select the appropriate formula for the t statistic
Step-3: Set the level of significance
The confidence level is taken as 95% ( 𝛼 = 0.05 ).
Step-4: Set the decision rule
±t/2 -t +t
(Two Tailed) (Left Tailed) (Right Tailed
with d.f = n - 1. So if the computed t value is outside the ±t/2 range,
the null hypothesis will be rejected, otherwise, it is accepted
The t-test for two population means
Step-5 : Collect the sample data
Pooled standard deviation,
s=
Step-6: Analyses the data
Step-7: Statistical conclusion and business implication
If the value of the test statistic falls in the rejection
region, reject H0, otherwise do not reject H0.
Examples
Process for comparing two population means using
independent samples
Popualtion-1 Popualtion-2
(Faculty in Public Institutions) (Faculty in Private Institutions)
Sample-1 Sample-2
Compute Compute
Compare and
based on pooled variance
Make decision
Example-2:
A market is controlled by two companies-A and B. company A is concerned that a sizeable
number of its customers may shift to company B because of an aggressive advertisement
campaign launched by it. In order to assess the anticipated brand shift, the researchers at
company A have prepared a questionnaire to measure customer satisfaction and have
administered it to customers. The questionnaire consisted of 10 questions on five point rating
scale with1 rated as “ strongly disagree” and 5 rated as “ strongly agree”. The questionnaire
has been administered to 10 randomly selected customers of company A and 12 randomly
selected customers of company B. The score obtained from these customers are given in the
following table. Test whether there is a difference in mean scores obtained from customers in
the population. Assume equal variance in the population.
Customers 1 2 3 4 5 6 7 8 9 10 11 12
Company-A 40 42 39 38 41 37 38 39 40 41 … …
Company-B 30 31 32 34 35 32 30 34 35 36 32 31
Solution:
Step-1 : Set the null and alternate hypothesis
H :
0 1 2
H :
1 1 2
Step-2: Select the appropriate formula for the t statistic
(X 1 -X 2) - (1 - 2)
t= sX 1 - X 2
sX 1 - X 2 = s 2 (n1 + n1 )
1 2
can be estimated by pooling two sample variances and computing a
pooled standard deviation as
s=
Step-3: Set the level of significance
The confidence level is taken as 95% ( 𝛼 = 0.05 ).
Step-4: Set the decision rule
For = 0.05 and degree of freedom 10 +12- 2= 20, the value of t from the
distribution table is + 2.086 and -2.086. The null hypothesis will be rejected if the
observed value of t is outside + 2.086 and -2.086.
Step-5 : Collect the sample data
Company-A vs. Company-B
Customer Company-A Customer Company -B
Number Number
1 40 .5 .25 1 30 -2.66 7.070
2 42 2.5 6.25 2 31 -1.66 2.755
3 39 -.5 .25 3 32 -.66 .435
4 38 -1.5 2.25 4 34 1.34 1.795
5 41 1.5 2.25 5 35 2.34 5.475
6 37 -2.5 6.25 6 32 -.66 .435
7 38 -1.5 2.25 7 30 -2.6 7.070
8 39 -.5 .25 8 34 1.34 1.795
9 40 .5 .25 9 35 2.34 5.475
10 41 1.5 2.25 10 36 3.34 11.15
11 32 -.166 .435
12 31 -1.66 2.755
= 39.5 22.5 = 32.66 46.64
First Sample (Company A)
Sample mean = 39.5
Sample size = 10
Sample variance = 2.5
Second Sample (Company B)
Sample mean = 32.6666
Sample size = 12
Sample variance = 4.242
Sample variance for company A,
= 22.5/9 = 2.5
,
Sample variance for company B,
, = 46.64/ 11 = 4.242
Pooled standard deviation
S=
= 1.859
Step-6: Analysis the data
sX 1 - X 2 = s 2 (n1 + n1 )
1 2
= .795
(X 1 -X 2) - (1 - 2)
t= sX 1 - X 2
= (39.5-32.66)-(0-0)/ .795
t = 8.58
Step-7: Interpretation
The computed value of the t statistics(8.58) is greater than the
critical value of the t statistics(+ 2.086). Hence ,the null hypothesis
is rejected and the alternative hypothesis is accepted.
Company A is 95% confident that the massive advertisement
campaign launched by company B has not affected the satisfaction
levels of its customers. In fact, the sample clearly indicates ( at 95%
confidence level) that customer satisfaction is higher for company A
when compared with company.
Paired t-test
Before-After
Manipulation
Pre-Measure Post-Measure
Matched Subject Design
For any study the two groups of subjects could be closely
matched
1. Age
2. IQ
3. Income
4. Education Level
Ex.-1
An electronic goods company arranged a special training programme for one
segment of its employees. The company wants to measure the change in the
attitude of its employees after the training. For this purpose, it has used a
well-designed questionnaire, which consists of 10 questions on a 1to 5 rating
scale ( 1 is strongly disagree and 5 is strongly agree). The company selected a
random sample of 10 employees. The score obtained by these employees are
given in the table.
Employees 1 2 3 4 5 6 7 8 9 10
Score before training 25 26 28 22 20 30 22 20 21 24
Score after training 32 30 32 34 32 28 25 30 25 28
Solution:
Step-1 : Set the null and alternate hypothesis
Step-2 : Select the appropriate formula for the t statistic
Step-3: Set the level of significance
The confidence level is taken as 95% ( 𝛼 = 0.05 ).
Step-4: Set the decision rule
±t/2 -t +t
(Two Tailed) (Left Tailed) (Right Tailed)
Value of 𝛼 is .05 and degree of freedom is 9. the tabular value is
+2.262 and – 2.262. the null hypothesis will be rejected if the observed
value of t is less than – 2.262 or greater than +2.262 .
Step-5 : Collect the sample data
Collect the sample data
Employees Before training After training Diff.
1 25 32 -7 1.44
2 26 30 -4 3.24
3 28 32 -4 3.24
4 22 36 -12 38.44
5 20 32 -12 38.44
6 30 28 2 60.84
7 22 25 -3 7.84
8 20 30 -10 17.64
9 21 25 -4 3 2.4
10 24 28 -4 2.4
= -58/10 = -5.8
= 4.442
= = 1.404
Step-6: Analyses the data
= - 4.13
Step-7: Statistical conclusion and business implication
So, the observed t value -4.13 is less than the tabular t value -1.833. hence,
the null hypothesis is rejected and the alternative hypothesis accepted.
Therefore, it can be concluded that there is a significant difference in the
attitude of employees before and after the training. The special training
programme organized by the company has significantly changed the
attitude of the employee. Hence, the company should organize this special
training programme for all its employees.
Non-Parametric Test
Nonparametric tests: features
• Nonparametric statistical tests can be used when the data being
analysed is not a normal distribution
• Many nonparametric methods do not use the raw data and
instead use the rank order of data for analysis
• Nonparametric methods can be used with small samples
Non-Parametric Tests
One sample:
1. Chi-square test
2. K-S test
3. Runs test
4. Binomial test
Chi-square test
One-way Chi-Square Test ( )
1. Used when your dependent variable is counts within categories
( Pepsi lovers, Coke lovers, Sprite lovers)
2. Used when your DV has two or more mutually exclusive categories
3. Compares the counts you got in your sample to those you would
expect under the null hypothesis
4. Also called the Chi-Square “Goodness of Fit” test.
Chi-Square Test for Goodness of Fit
• One sample, DV is at Nominal/Ordinal Level of Measurement
• This test , the chi-square good of fit, determines whether the sample
distribution fits some theoretical distribution
• Observed frequency : number of individuals from the sample who
are classified in a particular category
• Expected frequency: the frequency for a particular category that is
predicted from the null hypothesis
Kolmogorov-Smirnov test
Kolmogorov-Smirnov test
• The Kolmogorov-Smirnov test (K–S test) is a form of minimum distance
estimation used as a non-parametric test of equality of one-dimensional
probability distributions used to compare a sample with a reference
probability distribution (one-sample K–S test), or to compare two samples
(two-sample K–S test).
• The Kolmogorov–Smirnov statistic quantifies a distance between the
empirical distribution of the sample and the cumulative distribution
function of the reference distribution, or between the empirical distribution
functions of two samples.
• The null distribution of this statistic is calculated under the null hypothesis
that the samples are drawn from the same distribution (in the two-sample
case) or that the sample is drawn from the reference distribution (in the
one-sample case).
• In each case, the distributions considered under the null hypothesis are
continuous distributions but are otherwise unrestricted.
• Compare an empirical distribution function with the distribution function of
the hypothesized distribution.
Advantages
1. K-S tests do not require to group data in any way, so no information
is lost; this also eliminates the troublesome problem of interval
specification.
2. K-S tests are valid (exactly) for any sample size n (in the all-
parameters-known case), whereas Chi-Square tests are valid only in
an asymptotic sense.
3. K-S tests tend to be more powerful than Chi-Square tests against
many alternative distributions.
Runs -test
Runs -test
• Perform a runs test for randomness
• Runs tests are used to test whether it is reasonable to
conclude data occur randomly, not whether the data
are collected randomly.
Runs -test
• Runs test for randomness – used to test claims that data have
been obtained or occur randomly
• Run – sequence of similar events, items, or symbols that is
followed by an event, item, or symbol that is mutually
exclusive from the first event, item, or symbol
• Length – number of events, items, or symbols in a run
Binomial test
Bernoulli Process Definition
Properties of a Bernoulli (binomial) Process:
• There are two possible outcomes for each trial.
• The probability of the outcome remains constant over
time.
• The outcomes of the trials are independent.
• The number of trials is discrete and integer.
Use the above as a check-list to determine if a given process is
binomial.
Non-Parametric Tests
Two samples: Independent
1. Chi-square test
2. Mann-Whitney test
3. Median test
4. K-S test
Chi-Square Test For Independence Of Attributes
Chi-Square Test
1. Used to test whether two nominal variables are independent or
related
2. E.g. Is gender related to socio-economic class?
3. Compares the observed frequencies to the frequencies expected
if the variables were independent
4. Often called a chi-squared test of independence
5. Fundamentally testing, “do these variables interact”?
Mann-Whitney test
Mann-Whitney U test
• This is the nonparametric equivalent of the unpaired t-test
• It is applied when there are two independent samples randomly drawn
from the population e.g. diabetic patients versus non-diabetics .
• The data has to be ordinal i.e. data that can be ranked (put into order
from highest to lowest )
• It is recommended that the data should be >5 and <20 (for larger
samples, use formula or statistical packages)
• The sample size in both population should be equal
Uses of Mann-Whitney U test
Mainly used to analyse the difference between the
medians of two data sets.
We want to know whether two sets of measurements
genuinely differ.
Median Test
Median Test
• The median is the “middle” observation when the complete list of
observations is sorted in order.
• When there is a odd number of observations, the value of the middle
one is the median.
• When there is a even number of observations, the value of the average
of the two “middle” observations is used as the median.
• The median may be a better indication of the center of a group of
numbers if there are some values that are considerably more extreme
than others.
Kolmogorov-Smirnov test
Kolmogorov-Smirnov test two samples test
The two-samples KS formula is
(Two-tailed test)
(One-tailed test)
Kruskal-Wallis Test
The Kruskal-Wallis Test
• The Kruskal-Wallis Test is the alternative to one-way ANOVA.
• The Kruskal-Wallis Test is based on the assumption of independency of
group.
• The Kruskal-Wallis Test can be performed on ordinal data and is not based
on normality assumption of population.
• The Kruskal-Wallis Test is a nonparametric procedure that can be used to
compare more than two populations in a completely randomized design.
• All n = n1+n2+…+nk measurements are jointly ranked (i.e. treat as one
large sample).
• We use the sums of the ranks of the k samples to compare the distributions.
Friedman test
Friedman test
This test has following assumptions:
1. The blocks are independent.
2. There is no interaction between blocks and treatments.
3. Observations within each block can be ranked.
Where
Sign Test
Sign Test
• Sign test – tests whether the numbers of differences (+ve or –ve) between
two samples are approximately the same. Each pair of scores (before and
after) are compared.
• When “after” > “before” (+ sign), if smaller (- sign). When both are the
same, it is a tie.
• Sign-test did not use all the information available (the size of difference),
but it requires less assumptions about the sample and can avoid the
influence of the outliers.
Sign Test
• A common application of the sign test involves using a sample of n
potential customers to identify a preference for one of two brands of a
product.
• The objective is to determine whether there is a difference in
preference between the two items being compared.
• To record the preference data, we use a plus sign if the individual
prefers one brand and a minus sign if the individual prefers the other
brand.
• Because the data are recorded as plus and minus signs, this test is
called the sign test.
Wilcoxon Signed-Rank Test
Wilcoxon Signed-Rank Test
• The Wilcoxon signed-rank test applies to the case of symmetric
continuous distributions.
• Under this assumption, the mean equals the median.
• The null hypothesis is H0: = 0
Wilcoxon Matched-Pairs Signed Rank Test
• A nonparametric alternative to the t test for related samples
• Before and After studies
• Studies in which measures are taken on the same person or
object under different conditions
• Studies or twins or other relatives
Wilcoxon Matched-Pairs Signed Rank Test
• Differences of the scores of the two matched samples
• Differences are ranked, ignoring the sign
• Ranks are given the sign of the difference
• Positive ranks are summed
• Negative ranks are summed
• T is the smaller sum of ranks
Computing the Wilcoxon matched-pairs test
• First, a difference score is calculated for each case by subtracting the
actual score for the first variable in the pair from the actual score for the
second variable in the pair. If the difference score is zero because the
actual scores are the same, the case is dropped from the analysis.
• Second, the difference scores are rank ordered from low to high by size,
ignoring the sign of the difference scores (I.e. using absolute values)
• Third, the ranks associated with the negative difference scores are
summed and the ranks associated with the positive difference scores are
summed.
• Fourth, the probability of the Wilcoxon match-pairs signed ranks test
statistic is computed, using the smaller of the total summed ranks in the
formula.
• Fifth, since the test is always done with the smaller of the summed ranks,
we must link the relationship in the research question to the pattern of
summed ranks to make certain that our finding is consistent with the
relationship we wanted to test.
McNemar’s test
McNemar’s test
• The Mcnemar test may be used with either nominal or ordinal data
and is especially useful with before-after measurement of the same
subjects.
• McNemar – tests whether the changes in proportions are the same
for pairs of dichotomous variables.
• McNemar’s test is computed like the usual chi-square test, but only
the two cells in which the classification don’t match are used.
Analysis of Variance
• What is ANOVA?
• When is it useful?
• How does it work?
• Some Examples
Introduction
• Analysis of variance compares two or more
populations of interval data.
• Specifically, we are interested in determining whether
differences exist between the population means.
• The procedure works by analyzing the sample
variance.
Variables
1. Dependent variable: Metric data (Continuous)
Ex. Sales, Profit, Market share Age, Height etc.
2. Independent variable: Non-metric data(categorical)
Ex. Sales promotion(High, Medium and Low),
Region ( Delhi, Lucknow and Bhopal)
Questionnaire
• Metric data: Ex. The sales of this organization is ……,
I am satisfied the quality of product
1………...2………..3……...4……………5
Strongly disagree Disagree Neutral Agree Strongly agree
• Non-metric data: Ex. In-store promotion-
a) High,
b) Medium
c) Low
Data
• Any data set has variability
• Variability exists within groups…
• and between groups
Notation
• k = Number of groups
• n = Number observations in each group
• Yi = Individual observation
• Yij = ith observation in the jth category
• SS = Sum of Squares
• MS = Mean of Squares
• F = MSbetween/ MSerror
Calculation of SS Values
Calculating Mean Square Values & F-Ratio
• MS = SS/df
• MSbetween = SSbetween /(k-1)
• MSerror = SSbetween /(n-k)
• F-Ratio = MSbetween / MSerror
Total sum of squares
• SSy = SSbetween + SSwithin
or
• SSy = SSx + SSerror
Where
SSy = The total variation in Y
SSbetween = Variation between the categories of X
SSwithin = Variation in Y related to the variation within each
category of X
Calculating Mean Square Values & F-Ratio
• MS = SS/df
• MSbetween = SSbetween /(k-1)
• MSerror = SSbetween /(n-k)
• F-Ratio = MSbetween / MSerror
Hypothesis Testing & Significance Levels
F-Ratio = MSbetween / MSerror
• If:
– The ratio of Between-Groups MS: Within-Groups MS is
LARGE reject H0 there is a difference between
groups
– The ratio of Between-Groups MS: Within-Groups MS is
SMALLdo not reject H0 there is no difference between
groups
p-values
• Use table in stats book to determine p
• Use degree of freedom for numerator and
denominator
• Choose level of significance
• If Fcalculated > Fcritical value, reject the null hypothesis
(for one-tail test)
One Way Analysis of Variance
• Example
– An experiment in which a major department store chain wanted to
examine the effect of the level of in -store promotion.
– The experimenter wants to know the impact of the in -store
promotion on sales.
– In store promotion was varied at three levels:
1. High promotion.
2. Medium promotion.
3. Low promotion.
Data…..
Metric data: Sales(Continuous variable)
1- to- 10 scale used, higher numbers denote higher sales.
1…….2…….3……4…….5……..6……7…..8……..9……10
Low High
Non- metric data: In- store promotion(Categorical variable)
High promotion-1
Medium promotion-2
Low promotion-3
Effect of In-store promotion on sales
Store No. Level of In- store promotion
High Medium Low
1 10 8 5
2 9 8 7
3 10 7 6
4 8 9 4
5 9 6 5
6 8 4 2
7 9 5 3
8 7 5 2
9 7 6 1
10 6 4 2
Column total 83 62 37
Category means: 83/10= 8.3 62/10= 6.2 37/10= 3.7
Grand means, ( 83+62+37)/30 = 6.067
• Suppose that only one factor, namely in-store
promotion, was manipulated. The department store is
attempting to determine the effect of in- store
promotion(X) on sales(Y).
• The null hypothesis is that the category means
are equal:
H0: m1 = m2= m3
H1: At least two means differ
To test the null hypothesis, the various sums of
squares are computed:
ANOVA Calculations: Sums of Squares
SSy =
(10 - 6.067)2 + (09 - 6.067)2 +(10 - 6.067)2 + (08 - 6.067)2 + (09 - 6.067)2 +
(08 - 6.067)2 + (09 - 6.067)2 + (07 - 6.067)2 + (07 - 6.067)2+ (06 -6.067)2 +
(08 - 6.067)2 + (08 - 6.067)2 + (07 - 6.067)2 + (09 - 6.067)2 + (06 -6.067)2 +
(04 - 6.067)2 + (05 - 6.067)2 + (05 - 6.067)2 + (06 - 6.067)2 + (04 -6.067)2 +
(05 - 6.067)2 + (07 - 6.067)2 + (06 - 6.067)2 + (04 - 6.067)2 + (05 -6.067)2 +
(2 - 6.067)2 + (3 - 6.067)2 + (2 - 6.067)2 + (1 - 6.067)2 + (2 - 6.067)2
Continued….
SSy =
(3.933)2 + (2.933)2 + (3.933)2 + (1.933)2 + (2.933)2 + (1.933)2 +
(2.933)2 + (0.933)2 + (0.933)2 + (-0.067)2 + (1.933)2 + (1.933)2 +
(0.933)2 + (2.933)2 + (-0.067)2 + (-2.067)2 + (-1.067)2 + (-1.067)2 +
(-0.067)2 + (-2.067)2 + (-1.067)2 + (0.933)2 + (-0.067)2 + (-2.067)2 +
(-1.067)2 +(-4.067)2 +(-3.067)2 +(-4.067)2 +(-5.067)2 +(-4.067)2
= 185.867
Sum of Square between the categories
SSx = 10(8.3 - 6.067)2 + 10(6.2 - 6.067)2+ 10(3.7 - 6.067)2
= 10 (2.233)2 + 10 (0.133)2 +10 (-2.367)2
= 106.067
Sum of square within the categories
SSerror =
(10-8.3)2 + (09-8.3)2 + (10-8.3)2 + (08-8.3)2 + (09-8.3)2 + (08-8.3)2 +
(09-8.3)2+ (07-8.3)2 + (07-8.3)2 + (06-8.3)2 + (08-6.2)2 + (08-6.2)2 +
(07-6.2)2+ (09-6.2)2 + (06-6.2)2 + (04-6.2)2 + (05-6.2)2 + (05-6.2)2 +
(06-6.2)2+(04-6.2)2+ (05-3.7)2 + (07-3.7)2 + (06-3.7)2 + (04-3.7)2 +
(05-3.7)2 +(02-3.7)2 +(03-3.7)2 +(02-3.7)2 +(01-3.7)2 +(02-3.7)2
Continued….
SSerror =
(1.7)2 + (0.7)2 + (1.7)2 + (-0.3)2 + (0.7)2 + (-0.3)2 + (0.7)2 + (-1.3)2
+ (-1.3)2 + (-2.3)2 + (1.8)2 + (1.8)2 + (0.8)2 + (2.8)2 + (-0.2)2 + (-2.2)2
+ (-1.2)2 + (-1.2)2 + (-0.2)2 + (-2.2)2 + (1.3)2 + (3.3)2 + (2.3)2 + (0.3)2
+ (1.3)2 + (-1.7)2 + (-0.7)2 + (-1.7)2 + (-2.7)2 + (-1.7)2
= 79.80
It can be verified that
SSy = SSx + SSerror
As follows: 185.867 = 106.067 +79.80
The strength of the effects of X on Y are measured as follows:
h2 SSx / SSy
= 106.067/185.867
= 0.571
57.1percent of the variation in sales (Y) is accounted for by
in-store promotion (X), indication a modest effect.
ANOVA Calculations: Mean Squares & Fcalc
MSx = SSx / (k-1)
= 106.067/2 = 53.033
MSerror = SSerror /(N-k)
= 79.800/27 = 2.956
Fcalc = MSx / MSerror
= 17.944
Fcrit = Fa,k-1,n-k = F.05,2,27 = 3.35
F-table portion with = .05
1
2 1 22 3 4 5 6 .... 60
1 161.4 199.5 215.7 224.6 230.2 234.0 .... 252.2
2 18.51 19.00 19.16 19.25 19.30 19.33 .... 19.48
3 10.13 9.55 9.28 9.12 9.01 8.94 .... 8.57
. . . . . . . .... .
. . . . . . . .... .
20 4.35 3.49 3.10 2.87 2.71 2.60 .... 1.95
. . . . . . . .... .
27
27 . 3.35. . . . . .... .
30 4.17 3.32 2.92 2.69 2.53 2.42 .... 1.74
40 4.08 3.23 2.84 2.61 2.45 2.34 .... 1.64
One -way ANOVA: Effect of In store promotion
on store sales
Source of df SS MS Fcalc Fcrit
Variation
Between 2 106.067 53.033 17.944 3.35
Groups
(In-store
promotion)
Within 27 79.800 2.956
Groups
(Error)
Total 29 185.867
Since Fcalc > Fcrit there is strong evidence for a difference
between Groups means.
Conclusion……
We see that for 2 and 27 degree of freedom, the critical value
of F is 3.35 for = 0.05. the calculated value of F is greater
than the critical value(3.35), we reject the null hypothesis. We
conclude that the population means for the three levels of in-
store promotion are indeed different. the relative magnitude of
the means for three categories indicate that a high level of in-
store promotion leads to significantly higher sales.
Regression
Regression analysis
Regression analysis examines associative relationships between a
metric dependent variable and one or more independent variables in
the following ways:
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
Regression analysis
• Determine the structure or form of the relationship: the
mathematical equation relating the independent and dependent
variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
• Regression analysis is concerned with the nature and degree of
association between variables and does not imply or assume any
causality.
Bivariate regression model
The basic regression equation is
Yi = 0 + i Xi + ei,
where Y = Dependent or criterion variable,
X = Independent or predictor variable,
0 = Intercept of the line,
i = Slope of the line, and
ei = Error term associated with the i th observation.
• Coefficient of determination
The strength of association is
measured by the coefficient of determination, r 2. It varies between
0 and 1 and signifies the proportion of the total variation in Y that is
accounted for by the variation in X.
• Estimated or predicted value
The estimated or predicted value of
Yi is i= a + b x, where i is the predicted value of Yi, and a and b
are estimators of 0 and 1 respectively.
• Regression coefficient
The estimated parameter b is usually
referred to as the non-standardized regression coefficient.
• Scattergram
A scatter diagram, or scattergram, is a plot of the
values of two variables for all the cases or observations.
• Standard error of estimate
This statistic, SEE, is the standard
deviation of the actual Y values from the predicted values.
• Standard error
The standard deviation of b, SEb, is called the
standard error.
• Standardized regression coefficient
Also termed the beta
coefficient or beta weight, this is the slope obtained by the
regression of Y on X when the data are standardized.
• Sum of squared errors
The distances of all the points from the
regression line are squared and added together to arrive at the sum
2
of squared errors, which is a measure of total error, S j
e
• t statistic
A t statistic with n - 2 degrees of freedom can be used to
test the null hypothesis that no linear relationship exists between X
and Y, or H0: β = 0, where t=b /SEb
Conducting Bivariate Regression Analysis
A scatter diagram, or scattergram, is a plot of the values of
two variables for all the cases or observations.
The most commonly used technique for fitting a straight
line to a scattergram is the least-squares procedure. In fitting the
line, the least-squares procedure minimizes the sum of squared
errors, Se j2 .
Conducting Bivariate Regression Analysis
Plot the Scatter Diagram
Formulate the General Model
Estimate the Parameters
Estimate Standardized Regression Coefficients
Test for Significance
Determine the Strength and Significance of Association
Check Prediction Accuracy
Examine the Residuals
Cross-Validate the Model
Plot of Attitude with Duration
9
Attitude
2.25 4.5 6.75 9 11.25 13.5 15.75 18
Duration of Residence
Which Straight Line Is Best?
Line 1
Line 2
9 Line 3
Line 4
6
2.25 4.5 6.75 9 11.25 13.5 15.75 18
Bivariate Regression
β0 + β1X
Y
YJ
eJ
eJ
YJ
X
X1 X2 X3 X4 X5
Assumptions
• The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal.
• The means of all these normal distributions of Y, given X, lie on a straight
line with slope b.
• The mean of the error term is 0.
• The variance of the error term is constant. This variance does not depend on
the values assumed by X.
• The error terms are uncorrelated. In other words, the observations have been
drawn independently.
Multiple Regression
Multiple Regression
The general form of the multiple regression model is as follows:
which is estimated by the following equation:
As before, the coefficient a represents the intercept, but the b's
are now the partial regression coefficients.
Statistics Associated with Multiple Regression
• Adjusted R2
R2, coefficient of multiple determination, is adjusted for the
number of independent variables and the sample size to account for the
diminishing returns. After the first few variables, the additional independent
variables do not make much contribution.
• Coefficient of multiple determination
The strength of association in multiple
regression is measured by the square of the multiple correlation coefficient,
R2, which is also called the coefficient of multiple determination.
• F test : The F test is used to test the null hypothesis that the coefficient of
multiple determination in the population, R2pop, is zero. This is equivalent
to testing the null hypothesis. The test statistic has an F distribution with k
and (n - k - 1) degrees of freedom.
• Partial F test :The significance of a partial regression coefficient, , of
Xi may be tested using an incremental F statistic. The incremental F
statistic is based on the increment in the explained sum of squares resulting
from the addition of the independent variable Xi to the regression equation
after all the other independent variables have been included.
• Partial regression coefficient: The partial regression coefficient, b1,
denotes the change in the predicted value, , per unit change in X1 when
the other independent variables, X2 to Xk , are held constant.
Partial Regression Coefficients
To understand the meaning of a partial regression
coefficient, let us consider a case in which there are two
independent variables, so that:
First, note that the relative magnitude of the partial
regression coefficient of an independent variable is, in general,
different from that of its bivariate regression coefficient.
Multicollinearity
• Multicollinearity arises when intercorrelations among the predictors are
very high.
Multicollinearity can result in several problems, including:
• The partial regression coefficients may not be estimated precisely.
The standard errors are likely to be high.
• The magnitudes, as well as the signs of the partial regression
coefficients, may change from sample to sample.
• It becomes difficult to assess the relative importance of the
independent variables in explaining the variation in the dependent
variable.
• Predictor variables may be incorrectly included or removed in
stepwise regression.
Multicollinearity
• A simple procedure for adjusting for multicollinearity consists of using only
one of the variables in a highly correlated set of variables.
• Alternatively, the set of independent variables can be transformed into a new
set of predictors that are mutually independent by using techniques such as
principal components analysis.
• More specialized techniques, such as ridge regression and latent root
regression, can also be used.