Statistics Course
Statistics Course
INTRODUCTION TO STATISTICS
What does the word statistics mean? To most people, it suggests numerical facts or data, such as
unemployment figures, farm prices, or the number of marriages and divorces. The most common
definitions of the word statistics are as follows:
Statistics is the science of planning studies and experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions
based on the data (Triola, 2012).
Statistics is facts or data, either numerical or qualitative,organized and summarized so as
to provide useful and accessible informationabout a particular subject (Weiss, 2012).
Statistics is the science of organizing and summarizing analyzing numerical or
anddecisions(Weiss, 2012).
Statistics is the science of collecting, organizing, analyzing, and interpretingdata in order
to make decisions
A statistics is some piece of information that is presented in numerical form - focuses on
appropriate ways to collect, codifies, analyze, and interpret numerical information (Dunn,
2001).For example,
A nation's 5% unemployment rate
received 61% ofthe popular vote
Academic achievement and status of students
Attrition and dropout rate of students in Woldia University across colleges
and departments
Budget allocated for Woldia University from 2004 to 2010 E.C
1
to be an intelligent consumer of statistical information
to write up analyses and results in American Psychological Association (APA) style
Statistics
Descriptive Inferencial
Statistics statistcs
Inferential statistics
Consists of methods for drawing and measuring the reliability of conclusions about a
population based on information obtained from a sample of the population.
permit generalizations to be made about populations based on sample data drawn from
them
Use statistics, which are measures of asample, to infer values of parameters,which are
measures of a population.
is the branch of statistics that involves using a sample todraw conclusions about a
population
Inferential statistics- t- test, correlation, ANOVA, MANOVA, regression, factor analysis that
use sample data and generalize the findings to the population
Descriptive statistics
2
statistical procedures that describe, organize, and summarize the main characteristics of
sample data
Simply describe the set of data at hand
is the branch of statistics that involves the organization,summarization, and display of
data
Descriptive statistics use – ratio, percentage, mean, tables, graphs, figures, charts, standard
deviations, diagram, range
Practical Example 1- Decide which part of the study represents the descriptive of
statistics. What conclusions might be drawn from the study using inferential
statistics?
1. A large sample of men, aged 48, was studied for 18 years. For unmarried
men, approximately 70% were alive at age 65. For married men, 90% were
alive at age 65.
2. A survey conducted among 1017 men and women by Opinion
ResearchCorporation International found that 76% of women and 60% of
men had aphysical examination within the previous year
Solution for question1
Descriptive statistics involves statements such as
“For unmarried men, approximately 70% were alive at age 65”
“For married men, 90% were alive at 65.”
3
An inference drawn from the study is that - a higher percentage of women had a
physical examinationwithin the previous year
Data
Data Sets
There are two types of data sets you will use when studying statistics. These data sets are called
populations and samples
Population
The complete collection of all individuals (scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that it includes all of the individuals to be
studied
The collection of all individuals or items under consideration in a statistical study
Complete set of events in which you are interested.
is the collection of all outcomes, responses, measurements, or counts that are of interest
For instance
if we were interested in thestress levels of all adolescent Americans, then the
collection of all adolescentAmericans’ stress scores would form a population,
4
the scores of all morphine-injected mice
the milk production of all cows in the country
Theages at which every girl first began to walk
the stress scores of the sophomore class in Woldia University
The population can range from a relatively small set of numbers, which is easily collected, to an
infinitely large set of numbers, which can never be collected completely. Ifthe populations in
which we are interested are usually quite large, then, collecting data can be difficult for
researchers so the researchers can collect data from a representative sample taken from the
population.
Census
Sample
In a recent survey, 1500 adults in the United States were asked if they thought there was
solid evidence of global warming. Eight hundred fifty-five of the adults said yes.
Solution
The population consists of the responses of all adults in the United States
5
The sample consists of the responses of the 1500 adults in the United States in
the survey.
Parameter
Statistic
N.B -It is important to note that a sample statistic can differ from sample to sample
whereas a population parameter is constant for a population.
Practical example 3:Decide whether the numerical value describes a population parameter
or a sample statistic.
1. A recent survey of 200 college career centers reported that the average starting salary
for petroleum engineering majors is $83,121.
2. The 2182 students who accepted admission offers to Northwestern University in 2009
have an average SAT score of 1442
3. In a random check of a sample of retail stores, the Food and Drug Administration found
that 34% of the stores were not storing fish at the proper temperature.
Solution
6
1. The average of $83,121 is based on a subset of the population, it is a sample statistic.
2. The SAT score of 1442 is based on all the students who accepted admission offers in 2009, it
is a population parameter.
3. The percent of 34% is based on a subset of the population, it is a sample statistic.
When doing a study, it is important to know the kind of data involved. The nature of the data you
are working with will determine which statistical procedures can be used.Data sets can consist of
two types of data: qualitative data and quantitative data.
Qualitative data
7
Practical Example 4 -As shown in the table below, which data are qualitative data and which
data are quantitative data
Qualitative data
Escape 32,260
A variable is any factor that can be measured or have a different value. Such factors can vary
from person to person, place to place, or experimental situation to experimental situation.
A variable is anything that can take on different values
For examples
1. Discrete variable
data are expressed in number with possible values is either a finite number or a
“countable” number 0, 1, 2, 3, 4,5 and so on
8
A quantitative variable whose possible values are counting numbers but not fractional
numbers
"discrete;' variable is used to characterize data interms ofwhole numbers (1,2,3, and so
on) with no fractional counts occurring betweenthem.
For example
9
Qualitative variables are those variables which differ in kind rather thandegree. These
could be measured on nominal or ordinal scales.
For example
Gender - females and males
Political parties – liberals, democratic, republican and so on
Grade levels – grade 1, grade 2 or 1st year, 2nd year, 3rd year
Economic status - destitute, poor, rich, wealthy
Academic status – warning, probation, promoted
Colleges – Educations, FBE, Technology, Agriculture ……
1.4 Scales / levels/ of measurement
Measurement represents a set of rules informing us of how values are assigned to objects or
events. Stevens, in 1946 identified four scales in his theory: nominal, ordinal, interval, and ratio
scales in that order. Each scale includes an extra feature or rule over those in the one before it.
We will add a fifth scale to Steven’s treatment, summative response scaling, placing it between
the ordinal and the interval scale.
1. Nominal Scales
an observation is simply given a name, a label, or otherwise classified
Nominal scales use numbers, but these numbers are not in any mathematical relationship
with one another.
Anominal scale uses numbers to identify qualitative differences among measurements.
The measurements made by a nominal scale are names, labels, or categories, and no
quantitative distinctions can be drawn among them.
More qualitative and provide less information
the lowest level of measurement are nominal scales
categorical variables that represent different categories
shows membership or member of a category
the data are organized in the form of frequency counts for a given category
Frequency counts simply tell us how manypeople we have in each category.
For example - Gender (1 = male, 2 = female), Ethnicity or religion of person, Smoker vs.
nonsmoker, literate versus illiterate,
2. Ordinal scales
10
The measurement of an observation involves ranking or ordering based on an underlying
dimension.
An ordinal scale ranks or orders observations based on whether they are greater than or
less than one another
Ordinal scales do not provide information about how close or distant observations are
from one another.
An ordinal scale of measurement uses numbers to convey “less than” and“more than”
information. This most commonly translates as rank ordering. Objects may be ranked in
the order that they align themselves on somequantitative dimension but it is not possible
from the ranking information to determine how far apart they are on the underlying
dimension.
3 INTERVAL SCALES
Interval scales of measurement have all of the properties of nominal ordinal, and
summative response scales.
The most common illustrations of an equal interval scale are the Fahrenheit and Celsius
temperature scales.
According to Stevens “Equal intervals of temperature are sealed off by noting equal
volumes of expansion.Eg” Essentially, the difference in temperature between 30 and 40◦
F is equal to the difference between 70 and 80◦ F.
A less-obvious but important characteristic of interval scales is that they have
arbitrary zero points.
For example, the term zerodegrees do not mean the absence of temperature – on the
Celsius scale, zero degrees is the temperature at which water freezes.
As was true for summative response scales, it is meaningful to average data collected on
an interval scale of measurement.
The average high temperature in our home town last week was 51.4◦ F
4. RATIO SCALES
11
Ratio scales are time and measures of distance.
interpret in a meaningful way ratios of the numbers on these scales
four hours is twice as long as two hours or that three miles is half the distance of six
miles
CHAPTER TWO
ORGANIZING AND PRESENTING DATA
1.1 Raw Data
Raw data areprimary data or secondary data (e.g., numbers, instrument readings, figures, etc.)
collected from a source. Raw data
Step 1 - List the distinct values of the observations in the data set in the first column of a table.
Step 2 - For each observation, place a tally mark in the second column of the table in the row of
the appropriate distinct value.
12
Step 3 - Count the tallies for each distinct value and record the totals in the third column of the
table.
What is the highest level of education you have completed (please tick)?The responses of the 40
participants in the study are given in Table below. Determine a frequency distribution of these
data
❐ 1. Illiterate ❐ 4.Technique/College/
❐ 2.Primary school ❐ 5. Undergraduate university
❐ 3.secondary school ❐ 6. Postgraduate
Solution
Step 1 - List the distinct values of the observations in the data set in the first column of a table.
Step 2 - For each observation, place a tally mark in the second column of the table in the row of
the appropriate distinct value.
13
Step 3 Count the tallies for each distinct value and record the totals in the third column of the
table. Counting the tallies in the second column of Table, gives the frequencies in the third
column of Table. The first and third columns of Table and provide a frequency distribution for
the data in Table.
Postgraduate /// 3
Total 40
In addition to the frequency that a particular distinct value occurs, we are often interested in the
relative frequency, which is the ratio of the frequency to the total number of observations:
Step 1 Obtain a frequency distribution of the data.We obtained a frequency distribution of the
data from
❑
Frequency
Relative frequency=
Number of total observations
❑
Frequency of category
Relative frequency a category=
Number of total observations
❑ ❑
Frequency of Democratic 4
Relative F for illiterate= = =0.1
Number of total observations 40
14
❑ ❑
Frequency of primary 12
Relative F for primary= = =0.3
Number of total observations 40
❑
Frequency of others 8
Relative F for secondary= = =0.2
Number of total observations 40
relative
category tally Frequency frequency
Illiterate 4 10
Primary 12 30
Secondary 8 20
What is the Technique 7
highest level /College 17.5
of education Undergraduate 6 15
you have Postgraduate 3 7.5
completed? Total 40 100
A bar graph is a graph that displays the frequency or numerical distribution of a categorical
variable, showing values for each bar next to each other for easy comparison. A bar chart is a
graphical display of data that have been classified into a number of categories. Equal-width
rectangular bars are used to represent each category, with the heights of the bars being
proportional to the observed frequency in the corresponding category.
15
2. Bars can be vertical or horizontal
4. The y-axis represents the quantitative values of the variable being displayed
7. Height of bars represent the values of the variable displayed, the frequency of occurrence or
percentage of occurrence
8. The graph is well-annotated with title, labels for each bar, vertical scale, horizontal categories,
source
16
What is the highest level of education
you have completed?
What is the highest level of education you have completed?
30%
20% 18% 15%
10% 8%
te y y e e e
ra ar d ar lleg uat uat
te rim n d d
Illi P co Co ra ra
Se ue/ erg stg
iq d Po
chn Un
Te
Pie Chart
A pie chart is a disk divided into wedge-shaped pieces proportional to the relative
frequencies of the qualitative data
Another method for organizing and summarizing data is to draw a picture of some kind. The old
saying “a picture is worth a thousand words” has particular relevance in statistics—a graph or
chart of a data set often provides the simplest and most efficient display. Two common methods
for graphically displaying qualitative data are pie charts and bar charts. We begin with pie charts.
17
What is the highest level of education
you have completed?
Illiterate Primary Secondary
Technique /College Undergraduate Postgraduate
10%
8%
15%
30%
18%
20%
Bar Graphs are easier to make & to read than pie charts
Both pie charts & bar graphs can display the distribution of a categorical variable
A bar graph can also compare any set of quantities measured in the same units
Organizing Quantitative Data using frequency distribution
To organize quantitative data, we first group the observations into classes. Consequently, once
we group the quantitative data into classes, we can construct frequency and relative-frequency
distributions of the data in exactly the same way as we did for qualitative data. Several methods
can be used to group quantitative data into classes. Here we discuss two of the most common
methods: single-value grouping and limit grouping
Single-Value Grouping
18
In some cases, the most appropriate way to group quantitative data is to use classes in which
each class represents a single possible value. Such classes are called singlevalue classes, and this
method of grouping quantitative data is called single-value grouping.
Table 7: Test scores taken from first year students in statistics class
A second way to group quantitative data is to use class limits. With this method, each class
consists of a range of values. The smallest value that could go in a class is called the lower limit
of the class, and the largest value that could go in the class is called the upper limit of the class.
This method of grouping quantitative data is called limit grouping. It is particularly useful when
the data are expressed as whole numbers and there are too many distinct values to employ single-
value grouping.
19
Class width: The difference between the lower limit of a class and the lowerlimit of the
next-higher class.
Midpoint: The average of the two class limits of a class.
Table 10: Grouped data frequency distribution
Range 36
Class width (i) = = 7.2 (round to 8)
Numberofinterval 5
To set the lower and upper boundary 0.5 is subtracted from the lower limit and added to the
upper limit boundary of each class interval. Therefore, the class boundary of the distribution is
organized as follows
20
UNIT THREE
Central tendency is a statistical measure that determines a single value that accurately
describes the center of the distribution and represents the entire distribution of scores
The goal of central tendency is to identify thesingle value that is the best representative
forthe entire set of data.
21
Measure of central tendency is a single value representing a group of values and hence
issupposed to have the following properties.
A good measure of central tendency must be easy to comprehend and the procedure involved in
its calculation should be simple.
3. Rigidly defined
A measure of central tendency must be clearly and properly defined. It will be better if itis
algebraically defined so that personal bias can be avoided in its calculation.
A good average should not be unduly affected by the extreme or extra ordinary values in a series.
22
It is capable of further algebraic treatment
Mean is the center in balancing the values on either side of it and hence is more typical
The mean is sensitive to the exact value of all the scores in the distribution
The sum of the deviations about the mean equals zero
3.1.2 Computing Means of Ungrouped Data
sum of all x
x−bar =
number of x
Or
Example: The following data represents the ages of 20 students in a statistics class. Calculate the
meanage ofstudents.
20 20 20 20 20 20 21
21 21 21 22 22 22 23
23 23 23 24 24 65
23
Step1 – Prepare class interval or boundary
Step 3 -Find the sum of the products of the midpoints and the frequencies.
Cumulativ
Class
Frequency e Midpoints (x) (x∙f)
interval
frequency
18 – 25 13 13 21.5 279.5
26 – 33 8 21 29.5 236
34 – 41 4 25 37.5 150
42 – 49 3 28 45.5 136.5
50 – 57 2 30 53.5 107
N = 30 Σ(x∙f) = 909
Then, Mean=
∑ ( f . x ) = 909 = 30.3
N 30
24
Class Interval Frequency Cumulative Frequency FX Midpoints (x)
(F)
1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 44 22
4 24.5 – 29.5 7 11 189 27
5 29.5 – 34.5 3 14 96 32
6 34.5 – 39.5 2 16 74 37
7 39.5 – 44.5 4 20 168 42
Mean =
∑ fx = 600 = 30
N 20
Median is a point in the data set above and below which half of the cases fall.
The median of a data set is the measure of center that is the middle
value when
the original data values are arranged in order of increasing (or
decreasing)
magnitude.
The median is the middle score of a data set if the scores are organized from the
smallest to the largest.
The median is a number or score that precisely divides adistribution of data in
half. Fifty percent ofadistribution's observations will fall above the median and
fifty percent will fall below it.
The middle number in an ordered set of numbers. Divides the data into two equal
parts.
25
The median can be used for calculations involving ordinal, interval, or ratioscale
data
difficult to compute because data must be sorted
best average for ordinal data
unaffected by extreme data
If a data set is odd in number, the median falls exactly on the middle number.
If a data set is even in number: the median is the average of the two middle values.
26 32 21 12 15 11 27 16 18 21 19 28 10 13 31
Step 1: To calculate the median, arrange the scores from the lowest to the highest:
Step 2: The location of the median score can be found by taking the middle value or using a
N +1
simple formula: Median =
2
= 15+1
2
=8
Based on the following frequency distribution, answer the questions given below the data.
26
Questions
There are steps for the calculation of the median in frequency distribution
n
Step2: Find ( )to identify the median class
2
n
Step3: See in the cumulative frequency the value first greater than ( ), Then the corresponding
2
class interval is called the Median class.
Step 4:Calculate the median of the distribution
20
Median class (
2
) = 10
n
−m
Median = L + 2 xC
f
Where: n = the total number of scores
L = the lower limit of the median class
m = the frequency before the median class
f = frequency of the median class
c = class width
The median lies between 4 and 11. Corresponding to 4 the less than class is 24.5 and
corresponding to 11 the less than class is 29.5. Therefore the median class is 24.5-29.5. Its lower
limit is 24.5.
Here L = 24.5, n= 20, f = 7, c = 20, m =4
10−4 6
Median = 24.5 + x5= 24.5 + x 5 = 24.5 + 4.28 = 28.285
7 7
27
3.1.1. Properties of mode
Based on the following frequency distribution, answer the questions given below the data.
28
Questions
The modal class can be easily identified compared with median. The modal class can be
observed with the higher frequency in frequency of the distribution. Then, the modal class is 24.5
to 29.5.
3 3
Mode = 24.5 + x 5 = 24.5 + x 5 = 24.5 + 3 = 27.5
3+2 5
Class work
29
10 M 2 M 8 M 7 F
2 F 8 F 9 F 6 M
Based on the above table above answer the following questions
UNIT FOUR
Measures of variability provide information about the amount of spread or dispersion among the
variables. Range, variance, and standard deviation are the common measures of variability.
Range
Simply the difference between the largest and smallest values in a set of data
Is considered primitive as it considers only the extreme values which may not be useful
indicators of the bulk of the population.
The formula is - Range = largest observation - smallest observation
is the difference between the largest and the smallest values.
used for ordinal data
Range = the highest – the lowest scores
Standard deviation
30
Measures the ‘average deviation’ of observations from the mean
used on ratio or interval data
The standard deviation measures the variation among data values.
Values close together have a small standard deviation, but values with muchmore
variation have a larger standard deviation.
For many data sets, a value is unusual if it differs from the mean by more thantwo
standard deviations
For example – The following are assessment scores of students in Abnormal psychology
Then, calculate the variance and standard deviation of the data set
31
Sum of square x = 88.5
Sample Variance
2
2 = Σ ( x−x ) 88.5
S n−1
= 10−1 = 9.83
SD = √ 9.83 = 3.135
Or
Variance
is the sum of the squared deviations of each value from the mean divided by the number
of observations
mean of squared differences between scores andthe mean
used on ratio or interval data
used for advanced statistical analysis
32
is equal to the average of the squared deviations from the mean of a distribution.
Symbolically, sample variance is s2and population variance is
For example:
Classwork - Test
Measures of position tell where a specific data value falls within the data set or its relative
position in comparison with other data values.
Interquartile Range
= Q3 - Q1
Find a quartile by determining the value in the appropriate position in the ranked data, where
33
For example: Sample Ordered Data: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
soQ1 = (12+13)/2 = 12.5
Q2 is in the(9+1)/2 = 5th position of the ranked data,
soQ2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
soQ3 = (18+21)/2 = 19.5
Then, Interquartile range = Q3 – Q1 = 19.5 - 12.5 = 7
For example, the 50th percentile, denoted, has about 50% of the data values below it and about
50% of the data values above it. So the 50th percentile is the same as
the median. There is not universal agreement on a single procedure for calculating
percentiles, but we will describe two relatively simple procedures for
34
(1) Finding the percentile of a data value,
x
p= ∗100
n
When:
L - Locator that gives the position of a value (Example: For the 12th value in
L = K/ 100 *N = 25/100 * 100 = 3.25 = 4 position = 6 - This shows that 25% of students scored
6 and below
35
Find the value of 50th percentile
4.3.5. Z-score
Z-scores are merely scores expressed in terms of the number of standard statistical
units of measurement (standard deviations) they are from the mean of the set of scores.
A z score (or standardized value) is found by converting a value to a
standardized
scale, as given in the following definition. This definition shows that a z score
is the
number of standard deviations that a data value is deviated from the mean.
A z score (or standardized value) is the number of standard deviations
that
a given value x is above or below the mean
We used the range rule of thumb to conclude that a value is “unusual”
if it is more than 2 standard deviations away from the mean. It follows that
unusual
values have z scores less than-2 or greater than + 2.
36
A positive z-score means that a score is above the mean.
A negative z-score means that a score is below the mean.
A z-score of 0 means that a score is the exact sameas the mean
For example
A student scored a 65 on a math test that had a mean of 50 and a standard deviation of 10. She
scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative
position on the two tests.
Solution
37
Math: z = (65-50)/10= 15/10 = 1.5
The student did better in math because the z-score was higher
Example 2
Find the z-score for each test and state which test is better
Test A:
Test B:
CHAPTER FIVE
38
MEASURES OF RELATIONSHIP
The correlation coefficient, r, measures the strength of the linear relationship between two paired
variables in a sample.Pearson correlation or Spearman correlation is used when you want to
explore the strength of the relationship between two continuous variables. This gives you an
indication of both the direction (positive or negative) and the strength of the relationship. A
positive correlation indicates that as one variable increases, so does the other. A negative
correlation indicates that as one variable increases, the other decreases
39
Different authors suggest differentinterpretations; however, Cohen (1988, pp. 79–81) suggests
the following guidelines:
Category Positive Negative
SMALL r= 0.10 to0.29 r= -0.10 to-0.29
The Pearson r is used to advance research beyond the arena of descriptive statistics. Specifically,
the Pearson‘‘r’’ enables investigators to assess the nature of the association between two
variables, X and Y
The Pearson r, a correlation coefficient, is a statistic that quantifies the extent to which two
variables Xand Yare associated, and whether the direction of their association is positive,
negative, or zero.
Apositive correlation is one where as the value of Xincreases, the corresponding value of Y also
increases. Similarly, apositive correlation exists when the value of Xdecreases, the value of Y
also decreases.
Anegative correlation identifies an inverse relationship between variables X and Y-as the value
of one increases, the other necessarily decreases.
Azero correlation indicates that there is no pattern or predictive linear relationship between the
behaviorof variables Xand Y.
40
Each participants should have two measurements
Number of participants should be greater than 30
The distribution should be symmetric or normal
41
Identify the type of correlation from the
above scatter plot
For Example:
Academic
Absenteeism
Achievement(Y) XY 2 2
X Y
(X)
0 8 0 0 64
42
2 10 20 4 100
3 4 12 9 16
6 6 36 36 36
9 1 9 81 1
10 3 30 100 9
43
r = -.797
Correlations
Absenteeism GPA
Absenteeism Pearson Correlation 1 -.797
Sig. (2-tailed) 0.00
N 300
GPA Pearson Correlation -.797 1
Sig. (2-tailed) 0.00
N 300
There was a strong, negative correlation between the two variables, r = –.797, n = 300, p < .0005,
with high levels of absenteeism associated with lower levels of GPA.It implies that the
relationship is negative and significant. This shows that the relationship between absenteeism
and GPA isnegative - as absenteeism increases GPA decreases
Class work
Test1 6 6 5 4 7 4 4 3 6 10 6 6 4 8 12 12 11
Test2 8 4 8 2 4 8 2 5 10 10 10 8 7 12 11 10 9
Students 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Calculate the relationship between Test 1 and Test 2, check its significance and interpret it
44
- there are two a ranked data for variable A and variable B
- The data is skewed for from the normal distribution
- If N is less than 30
The Pearson correlation coefficient is the dominant correlation index in psychological statistics.
There is another called Spearman’s rho which is not very different. Instead of taking the scores
directly from your data, the scores on a variable are ranked from smallest to largest. That is, the
smallest score on variable X is given rank 1, the second smallest score on variable X is given
rank 2, and so forth. The smallest score on variable Y is given rank 1, the second smallest score
on variable Y is given rank 2, etc. Then Spearman’s rho is calculated like the Pearson correlation
coefficient between the two sets of ranks as if the ranks were scores.
A special procedure is used to deal with tied ranks. Sometimes certain scores on a variable are
identical. There might be two or three people who scored 7 on variable X, for example.This
situation is described as tied scores or tied ranks. The question is what to do about them. The
conventional answer in psychological statistics is to pretend first of all that the tied scores can be
separated by fractional amounts. Then we allocate the appropriate ranks to these ‘separated’
scores but give each of the tied scores the average rank that they would have received if they
could have been separated
The two scores of 5 are each given the rank 2.5 because if they were slightly different they
would have been given ranks 2 and 3, respectively. But they cannot be separated and so we
average the ranks as follows:
In the above Table, there are two 5 scores, for these tied scores, the average rank should be given
2+3/2 = 2.5
7+8+9/ 3 = 8
45
There are three scores of 9 which would have been allocated the ranks 7, 8 and 9 if the scores
had been slightly different from each other. These three ranks are averaged to give an average
rank of 8 which is entered as the rank for each of the three tied scores
Participants 1 2 3 4 5 6 7 8 9 10
Test1 for MA 8 3 9 7 2 3 9 8 6 7
Rank1 7.5 2.5 9.5 5.5 1 2.5 9.5 7.5 4 5.5
Test2 for MUA 2 6 4 5 7 7 2 3 5 4
Rank 2 1.5 8 4.5 6.5 9.5 9.5 1.5 3 6.5 4.5
Difference (D) 6 5.5 5 1 8.5 7 8 4.5 2.5 1
D2 36 30.25 25 1 72.25 49 64 20.25 6.25 1
This finding implies that Spearman correlation confident found evidence musical ability was
significantly and inversely related to mathematical ability(r =-0.85, p < 0.05).
Classwork exercise
46
1. A researcher wants to investigate the relationship between time of study per hour and
levels of perceived stress. A data is collected from 10 sample of students which indicated
below
2. Is there a relationship between students study time per hour and their levels of perceived
stress?
3. Do people with higher study time lower levels of perceived stress or higher level of
stress?
47
CHAPTER SIX
HYPOTHESIS TESTING
Many experiments are carried out with the deliberate object of testing hypotheses.
48
For example, consider the following statement
A. “Students who receive counseling will show a greater increase in creativity than
students not receiving counseling
Typically, in hypothesis testing, we have two options to choose from. These are termed as null
hypothesis and alternative hypothesis.
49
6.4 Directional & Non – directional Hypothesis
Directional hypotheses
We have specified the exact direction of the relationship between the two variables: as
study hours increased, so would exam grades. This is also called a one-tailed hypothesis
In some study, we are not sure of the exact nature of the relationships
In making such a prediction, there will be a relationship, but are not sure whether
as anxiety increase or decrease memory.
Therefore would want to predict only that there was a relationship between the two
variables without specifying the exact nature of this relationship this is called two-
tailed hypothesis.
6.5 Errors
Type I error
Suppose we conducted some research and found that the probability of finding the effect
we observe is small.
In a study, null hypothesis said there is no relationship between length of hair in male and
number of criminal offence committed .
50
But, we have obviously made a Type I error if we conclude that we have support that for our
prediction is that there will be a relationship between length of hair in males and number of
criminal offences committed
• A Type I error occurs when the sample data appear to show a treatment effect when, in
fact, there is none.
• In this case the researcher will reject the null hypothesis and falsely conclude that the
treatment has an effect.
• Type I errors are caused by unusual, unrepresentative samples. Just by chance the
researcher selects an extreme sample with the result that the sample falls in the critical
region even though the treatment has no effect.
• The hypothesis test is structured so that Type I errors are very unlikely; specifically, the
probability of a Type I error is equal to the alpha level.
Type II Errors
• In this case, the researcher will fail to reject the null hypothesis and falsely conclude that
the treatment does not have an effect.
• Type II errors are commonly the result of a very small treatment effect. Although the
treatment does have an effect, it is not large enough to show up in the research study.
51
A big difference in mean scores between conditions may be due to the predicted effects
of the independent variable rather than random variability. But there is always a specific
probability that the differences in scores are caused by total random variability. So there
can never be 100 percent certainty that the scores in an experiment are due to the effects
of selecting the independent variable.
Statistical tests calculate probabilities that results are significant. Statistical tables provide
probabilities that any differences in scores are due to random variability, as stated by the
null hypothesis. This means that the less probable it is that any differences are due to
random variability, the more justification there is for rejecting the null hypothesis. This is
the basis of all statistical tests. Statistical tables give the probability that scores in an
experiment occur on a random basis.
If the probability that the scores are random is very low, then you can reject the null
hypothesis that the differences are random. Instead you can accept the research
hypothesis that the experimental results are significant, that is, that they are not likely to
be random. Strictly speaking, the only conclusion from looking up probabilities in
statistical tables is that they justify rejecting the null hypothesis. But you will find that, if
the null hypothesis can be rejected, psychological researchers usually claim that the
results provide support for the predictions in the research hypothesis.
There is always a probabilistic component involved in the accept–reject decision in
testing hypothesis. The criterion that is used for accepting or rejecting a null hypothesis is
called significance level or p-value. The p-value represents the probability of concluding
(incorrectly) that here is a difference in your samples when no true difference exists.
It is a statistic calculated by comparing the distribution of a given sample data and an
expected distribution (normal, F, t etc.) and is dependent upon the statistical test being
performed.
For example, if two samples are being compared in a t-test, a p-value of 0.05 means that
there is only a 5% chance of arriving at the calculated t-value if the samples were not
different (from the same population).
In other words, a p-value of 0.05 means there is only a 5% chance that you would be
wrong in concluding that the populations are different or 95% confident of making a right
decision. For social sciences research, a p-value of 0.05 is generally taken as standard.
52
In psychology (possibly becauseit is thought that nothing too terrible can happen as a
result of accepting aresult as significant!) there is a convention to accept probabilities of
either1 per cent or 5 per cent as grounds for rejecting the null hypothesis.
The way levels of significance are expressed is to state that the probability of a result
being due to random variability is less than 1 per centor less than 5 per cent. That is why
in articles in psychological journalsyou will see statements that differences between
experimental conditionsare ‘significant (p < 0.01)’ or ‘significant (p < 0.05)’. This means
that theprobability (p) of a result occurring by chance is less than (expressed as <)1
percent (0.01) or 5 percent (0.05).
Sometimes, you will find other probabilities quoted, such as p < 0.02 orp< 0.001. These
represent probabilities of obtaining a random result 2times in 100 and 1 time in 1000 (2
per cent and 0.1 per cent). Thesepercentage probabilities give you grounds for rejecting
the null hypothesisthat your results are due to the effects of random variability
6.7 T –test
A t-test examines differences in the mean scores of a parametric dependent variable across two
groups or conditions (the independent variable). As we saw in Chapter 5, data are parametric if
they are represented by interval values and are reasonably normally distributed. The t-test
outcome is based on differences in mean scores between groups and conditions
One sample t-test is used to compare the mean of a single sample with the population mean. The
single- or one-sample t test is used to compare the observed mean of one sample with a
population mean. One-sample t tests are usually employed by researchers who want to determine
if some set of scores or observations deviate from some established pattern or standard.
Some situations where one sample t-test can be used are given below:
An economist wants to know if the per capital income of a particular region is same as
the national average.
The Quality Control department wants to know if the mean dimensions of a particular
product have shifted significantly away from the original specifications.
53
Is academic achievement of ECCE department students significantly deviated from the
academic achievement of Woldia University
Students 1 2 3 4 5 6 7 8 9
Test 8 7 5 6 8 7 8 6 6
Population mean ( μ) = 5
❑
H0: x 1 ¿ μ (the sample mean is equal to the population mean – no difference between the sample
mean and the population mean)
❑
H1: x 1 ≠ μ (the sample mean is different from the population)
54
6.77−5 1.77
= =4.86
= 1.092 .364
√9
The calculated t-value is 4.86 > the critical value 2.30 at 0.05 significance level. Then,
H0 is rejected.
*P < 0.05
This shows that there is a significant difference between the sample mean and the population
mean scores t (8) = 4.86, p < 0.05. This also implies that the sample mean score of stat test (M =
6.77) is significantly higher than the population mean score (M = 5) for students.
Basic concepts
An independent t-test measures differences between two distinct groups. Those differences might
be directly manipulated (e.g. drug treatment group vs. placebo group), they may be naturally
occurring (e.g. male vs. female), or they might be beyond the control of the experimenter (e.g.
depressed people vs. healthy people). In an independent t-test mean dependent variablescores are
compared between the two groups (the independent variable). For example, we couldmeasure
differences in the amount of money spent on clothes between men and women
55
The t test (unrelated) is based on comparing the means for the two groups doing each condition.
This is because there is no basis for comparing differences between related pairs of scores for
each participant. Because the t test (unrelated) is based on unrelated scores for two conditions,
which are independent of each other, another name for the t test (unrelated) is the independent t
test.
In many real life situations, we cannot determine the exact value of the population mean. We are
only interested in comparing two populations using a random sample from each. Such
experiments, where we are interested in detecting differences between the means of two
independent groups are called independent samples test. Some situations where independent
samples t-test can be used are given below:
An economist wants to compare the per capita income of two different regions.
A labor union wants to compare the productivity levels of workers for two different
groups.
An aspiring MBA student wants to compare the salaries offered to the graduates of two
business schools.
In all the above examples, the purpose is to compare between two independent groups in contrast
to determining if the mean of the group exceeds a specific value as in the case of one sample t-
tests.
Assumptions
56
Computing Independent Sample T- test
For example:
Male
1 2 3 4 5 6 7 8 9 10
students
Scores 4 6 5 7 8 4 3 2 4 5 48
X12 16 36 25 49 64 16 9 4 16 25 260
Female 11 12 13 14 14 15 15 16 17 18
students
Scores 8 9 6 7 8 10 8 9 7 10 82
❑
H0: mean 1=mean 2 (The two sample means are the same)
❑
H1: mean 1≠ mean 2 (the two means are different each other)
Step 2 Specify the level of significance = 0.05
57
Step 7 Make a decision to reject or fail to reject H0
In case of independent samples test for testing the difference between means, we assume that the
observations on one sample are not dependent on the other. However, this assumption limits the
scope of analysis as in many cases the study has to be done on the same set of elements (people,
objects etc.) to control some of the sample specific extraneous factors. Such experiments where
the observations are made on the same sample at two different times, is called dependent or
paired sample t-test. Some situations where dependent samples t-test can be used are given
below:
58
The HR manager wants to know if a particular training program had any impact
in increasing the motivation level of the employees.
The production manager wants to know if a new method of handling machines
helps in reducing the break down period.
An educationist wants to know if interactive teaching helps students learn more
as compared to one-way lecturing.
One can compare these cases with the previous ones to observe the difference. The subjects in all
these cases are the same and observations are taken at two different times
For example:
Maths 4 3 3 3 4 5 4 3 5 4 38
Civic 1 2 2 3 3 2 2 4 1 1 21
d 3 1 1 0 1 3 2 -1 4 3 17
d2 9 1 1 0 1 9 4 1 16 9 51
Student 1 2 3 4 5 6 7 8 9 10
❑
H0: x 1 ¿ x 2 (mean1 is equal to mean2 – no difference between the two means)
❑
H1: x 1 ≠ x 2 (mean1 is different from the mean2)
Step 2 Specify the level of significance = 0.05,
Step 4 Determine the critical value = from the table = 2.26 (0.05)
Step 5 Determine the rejection region – All the values > 2.26
59
17 17 17
∑d =
Step 6 Find the test statistic t= =
√ =
√ √ 24.5 =
2
51−¿(17) 221
√ n ∑ d −¿ ¿ ¿ ¿ ¿
2
10 X
10−1
¿
9
3.43
The calculated t-value is 3.43 > the critical value 2.26 at 0.05 significance level. Then,
H0 is rejected.
*P < 0.05
This shows that there is a significant difference between mathematics and Civic mean scores t
(9) = 3.43, p < 0.05. This also implies that the mean of mathematics (M = 3.8) is significantly
higher than the mean score of civic education (M =2.1) for students
The analysis of variance (ANOVA) currently enjoys the status of being probably the most used
statistical technique in psychological research integrating with other tests of analysis such as
regression, multiple analysis of variance and covariance. Analysis of variance is highly related
with t- test in comparing means in the process of conducting psychological researches. The
popularity and usefulness of this technique can be attributed to two facts. First, the analysis of
60
variance, like t, deals with differences between sample means, but unlike t, it has no restriction
on the number of means. Instead of asking merely whether two means differ, we can ask whether
two, three, four, five, or k means differ. Second, the analysis of variance allows us to deal with
two or more independent variables simultaneously, asking not only about the individual effects
of each variable separately but also about the interacting effects of two or more variables
(Pagano, 2009).
Based on the number of the independent variables included in the research, there are different
forms of the analysis of variance such as one way analysis , two way, three way and so on. On
the other hand, considering the design, the nature of the dependent variable and the hypothesis to
be tested scholars categorized analysis of variance in to between group participants design,
repeated measure design and mixed design. In other words, one way analysis of variance used
one independent variable having three and more levels with one dependent variable ( Hiwett&
Crammer, 2011).
As a parametric test, analysis of variance is interested in testing the null hypothesis having one
continuously measured dependent variable with one or more categorical independent variables.
The independent variables are expected to have different levels that have organized scores
obtained from data gathering tools. Stating the null and alternative hypotheses in symbols and in
words and thereby calculating the F-ratio in accordance with the steps are important activities in
analysis of variance. If the F-ratio showed significant differences across the means, post hoc test
analysis can be done in order to know which mean is significantly different from the others. At
the same time, calculating the effect size of the independent variable on dependent variable using
different statistical techniques such as omega and eta squares is still impotant (Dancey&Reidey,
2011).
Analysis of variance (ANOVA) is a method of testing the equality of three or more population
means by analyzing sample variances. Then, the logic using preferring One-way analysis of
variance instead of t-test is that
61
Like t test, analysis of variance deals with differences between two sample means, but
unlike t test, it has no restriction on the number of means. Instead, we can ask whether
two, three, four, five, or k means differ.
Analysis of variance allows us to deal with two or more independent variables
simultaneously, asking not only about the individual effects of each variable separately
but also about the interacting effects of two or more variables (Pagano, 2009).
According to Howell (2011) assumptions underlie all analysis of variance (ANOVA) using the F
statistic is organized below.
For reasons dealing with our final test of significance, we will make the assumption that scores
in each population should be normally distributed around the population mean. We made the
same assumption for t- test. Moreover, even substantial departures from normality may, under
certain conditions, have remarkably little influence on the final result.
62
In other words, the analysis of variance is robust with respect to violations of the
assumptions of normality and homogeneity of variance.
E. The different samples are from a populations that are categorized in only one way
These are expected to come one independent variable organized as levels. In other words, the
samples didn’t show the number of independent variables.
Analysis of variance (ANOVA), as the name suggests, analyses the different sources from which
variation in the scores arises.
Between-groups variance
63
ANOVA looks for differences between the means of the groups. When the means are very
different, we say that there is a greater degree of variation between the conditions. If there were
no differences between the means of the groups, then there would be no variation. This sort of
variation is called between-groups variation (Dancey&Reidey, 2011).
Treatment effects: When we perform an experiment, or study, we are looking to see that the
differences between means are big enough to be important to us, and that the differences
reflect our experimental manipulation. The differences that reflect the experimental
manipulation are called the treatment effects
Individual differences: Each participant is different, therefore participants will respond
differently, even when faced with the same task. Although we might allot participants
randomly to different conditions, sometimes we might find, say, that there are more
motivated participants in one condition, or they are more practiced at that particular task.
Experimental error: Most experiments are not perfect. Sometimes experimenters fail to
give all participants the same instructions; sometimes the conditions under which the tasks
are performed are different, for each condition. At other times, equipment used in the
experiment might fail, etc. Differences due to errors such as these contribute to the
variability.
Within-groups variance
Another source of variance is the differences or variation within a group. This can be thought of
as variation within the columns.
Within-groups variation arises from:
Individual differences: In each condition, even though participants have been given the
same task, they will still differ in scores. This is because participants differ among
themselves in abilities, knowledge, IQ, personality and so on. Each group, or condition, is
bound to show variability.
Experimental error: This has been explained above
Steps for test statistic in One-Way ANOVA
64
❑
H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal.)
Ha: At least one mean is different from the others
Step 2 Specify the level of significance = 0.05, 0.01, 0.1
Example 1: A researcher wanted to test the effect of study skills support on academic
achievement scores of students in DeberMarkos University. Then, he took 15 students who need
study skills support and Assign them randomly in to three groups such as placebo, low support
and high support. Level of significance for this hypothesis testing is 0.05. The data ollected from
students are presented in the following table.
n1 = 5 n2 = 5
N = 15
65
Solution:
Step1: State the null and alternative hypotheses
❑
H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal)
❑
Ha: μ1 ≠ μ2 ≠ μ 3 (At least one mean is different from the others)
Step 2: Specify the level of significance
In the F distribution, the rejection region is all the values greater than 3.89. In other words, if F
calculated greater than 3.89 reject the null hypothesis because it is in the rejection region or, if F
calculated is less than 3.89 accept the null hypothesis.
SSB = ¿ ¿+ ¿ ¿ + ¿¿ - ¿¿
SSB = ¿ ¿+ ¿ ¿ + ¿¿ - ¿¿
SSB = (80 + 320 + 845) - 1041.667
SSB = 1245 - 1041.667 SSB = 203.333
Calculate within sum of square (SSW)
SSW = Σ x 2−¿ ¿+ ¿¿ + ¿¿
SSW = 1299 -¿ ¿+ ¿ ¿ + ¿¿
66
SSW = 1299 - (80 + 320 + 845)
SSW = 1299 –1245 SSW = 54
Calculate total sum of square (SST)
SST = SSB + SSW, SST = 203.333 + 54 SST = 257.333
Calculate Between groups mean of square (MSB)
❑ ❑
SSB 203.333
MSB = MSB= MSB=101.667
DFB 2
Calculate within groups mean of square (MSW)
❑ ❑
SSW 54
MSW = MSW = MSW =4.5
DFW 12
❑ ❑
MSB 101.667
Calculate F- ratio F= F= = 22.59
MSW 4.5
Step 7 Make a decision to reject or fail to reject H0
F calculated = 22.59
Then, F calculated = 22.59 > F critical (2, 12) = 3.89 reject the null hypothesis. This shows the
location of the rejection region and the test statistic. Therefore, F is in the rejection region, you
should to reject the null hypothesis
67
3.3 Post hoc Analysis
Post hoc analysis is a multiple comparison techniques for making comparisons between two or
more group means subsequent to an analysis of variance. Since there is enough evidence at the
5% level of significance to conclude that the means of academic achievement scores of students
are different. Then, which mean is different from the others can be known through post hoc
analysis. Post hoc analysis methods are different in their power minimizing type I error. Some of
them are listed below.
Let’s use the Post hoc analysis technique of Tukey test for the example given above.
When Tukey test is used for the post hoc analysis we use Q-distribution to find the critical
value .Then, the multiple comparisons through Tukey test has four steps done as follows.
3. low study skills support with high study skills support (mean2 with mean3)
❑ ❑
mean3 – mean2 13 – 8
4. Q-cal =
√ MSW ❑
n √
= 4.5❑
5
= 5.27*
Step 2: Find Q-critical from Q-distribution by (r, df) - Q (5, 12) = 3.77
Therefore, for mean1 and mean2Q-cal>Q-cri or 4.5 > 3.77 reject the null hypothesis
For mean1 and mean3Q-cal>Q-cri or 9.48 > 3.77 reject the null hypothesis
For mean3 and mean2Q-cal>Q-cri or 5.27 > 3.77 reject the null hypothesis
68
Step 4: Interpretation
There is enough evidence at the 5% level of significance to conclude that study skills support all
means of academic achievement scores of students are significantly different each other.
Introduction
Regression analysis is a statistical technique that is widely used for research. Regression analysis is used
to predict the behavior of the dependent variables, based on the set of independent variables. In regression
analysis, dependent variables can be metric or non-metric and the independent variable can be metric,
categorical, or both a combination of metric and categorical. These days, researchers are using regression
analysis in two manners, for linear regression analysis and for non-linear regression analysis. Linear
regression analysis is further divided into two types, simple linear regression analysis and multiple linear
regression analysis. In simple linear regression analysis, there is a dependent variable and an independent
variable. In multiple linear regressions analysis, there is a dependent variable and many independent
variables. Non- linear regression analysis is also of two types, simple non-linear regression analysis and
multiple non-linear regression analysis. When there is a non-liner relationship between the dependent and
independent variables and there is a dependent and an independent variable, then it said to be simple non-
liner regression analysis. When there is a dependent variable and two or more than two independent
variables, then it said to be multiple non-linear regression.
Learning outcomes
Upon completing this topic, the students will be able to:
Describe basic concepts of regression
Appropriately use regression principles in different research fields
Apply regression models in research design
Perform regression analysis and interpret the results
Key Terms: Regression, Intercept, Slope, Curve it, Polynomial, Best fit line
Linear regression is the most basic and commonly used predictive analysis. Regression estimates are used
to describe data and to explain the relationship between one dependent variable and one or more
independent variables.
At the center of the regression analysis is the task of fitting a single line through a scatter plot. The
simplest form with one dependent and one independent variable is defined by the formula y = a + b*x.
69
Sometimes the dependent variable is also called endogenous variable, prognostic variable or regressand.
The independent variables are also called exogenous variables, predictor variables or regressors.
However Linear Regression Analysis consists of more than just fitting a linear line through a cloud of
data points. It consists of 3 stages – (1) analyzing the correlation and directionality of the data, (2)
estimating the model, i.e., fitting the line, and (3) evaluating the validity and usefulness of the model.
1). Might be used to identify the strength of the effect that the independent variable(s) have on a
dependent variable. Typical questions are what is the strength of relationship between dose and effect,
sales and marketing spend, age and income.
2). It can be used to forecast effects or impacts of changes. That is regression analysis helps us to
understand how much will the dependent variable change, when we change one or more independent
variables. Typical questions are how much additional Y do I get for one additional unit X.
3). Regression analysis predicts trends and future values. The regression analysis can be used to get point
estimates. Typical questions are what will the price for gold be in 6 month from now? What is the total
effort for a task X?
Assumptions:
Simple linear regression is a measure of linear association that investigates straight-line relationships
between a continuous dependent variable and an independent variable. It is explained best through
regression equation
70
βis estimated coefficient of the strength and direction of the relationship between the
independent (IV) and dependent variable (DV).
α (Y intercept) is a fixed point that is considered a constant (how much Y can exist without X)
Standardized Regression Coefficient (β)
Estimated coefficient of the strength of relationship between the IV and DV variables.
Expressed on a standardized scale where higher absolute values indicate stronger
relationships (Scale ranges is from -1 to 1).
Parameter Estimate Choices
Raw regression estimates (b1)
Raw regression weights have the advantage of retaining the scale metric—which is
also their key disadvantage.
If the purpose of the regression analysis is forecasting, then raw parameter estimates
must be used. The researcher is interested only in prediction.
Standardized regression estimates (β1)
Standardized regression estimates have the advantage of a constant scale.
Standardized regression estimates should be used when the researcher is testing
explanatory hypotheses
3.3. Predictive Methods
With the exception of the mean and standard deviation, linear regression is possibly the most widely
used of statistical techniques. This because any of the problems that we encounter in research settings
require that we quantitatively evaluate the relationship between two variables for predictive purposes.
By predictive, I mean that the values of one variable depend on the values of a second. We might be
interested in calibrating an instrument such as a sprayer pump. We can easily measure the current or
voltage that the pump draws, but specifically want to know how much fluid it pumps at a given
operating level. Or we may want to empirically determine the production rate of a chemical product
given specified levels of reactants.
Linear regression, which is the natural extension of correlation analysis, provides a great starting
point toward these objectives.
Curve fit - This is perhaps the most general term for describing a predictive relationship between two
variables, because the "curve" that describes the two variables is of unspecified form.
Polynomial fit - A polynomial fit describes the relationship between two variables as a mathematical
series. Thus a first order polynomial fit (a linear regression) is defined as y = a + bx. A second order
(parabolic) fit is y= a + bx + cx^2, a third order (spline) fit is y = a + bx + cx^2 + dx^3, and so on...
Best fit line - The equation that best describes the y or dependent variable as a function of the x or
independent.
Linear regression and least squares linear regression - This is the method of interest. The
objective of linear regression analysis is to find the line that minimizes the sum of squared deviations
of the dependent variable about the "best fit" line. Because the method is based on least squares, it is
said to be a BLUE method, a Best Linear Unbiased Estimator.
71
6.1.2. Defining the Regression Model
We've already stated that the general form of the generalized linear regression is: y= a + bx. The
coefficient "a" is a constant called the y-intercept of the regression. The coefficient "b" is called the
"slope" of the regression. It describes the amount of change in y that corresponds to a given change in
x.
Specifically, the slope is defined as the summed cross product of the deviations of x and y from their
respective means, divided by the sum of squares of the deviations x from it's mean. The second
relationship above is useful if these quantities have to be calculated by hand. The standard error
values of the slope and intercept can are mainly used to compute the 95% confidence intervals. If you
accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of
the slope contains the true value of the slope, and that the 95% confidence interval for the intercept
contains the true value of the intercept.
It's interesting to note that the slope in the generalized case is equal to the linear correlation
coefficient scaled by the ratio of the standard deviations of y and x:
72
This explicitly defines the relationship between linear correlation analysis and linear regression.
Notice that in the case of standardized regression, where sy and sx = 1,
From this
definition, it
should be clear
that the best fit line passes through the mean values for x and y.
Assumptions
There are several assumptions that must be met for the linear regression to be valid:
The scatter of the y values about y estimates (denoted yhat) based on the best fit line is often referred
to as the "standard error of the regression":
Notice that two degrees of freedom are lost in the denominator: one for the slope and one for the
intercept. A more descriptive definition - and strictly correct name - for this statistic is the root mean
square error (denoted RMS or RMSE).
73
How much variance is explained?
Just as in linear correlation analysis, we can explicitly calculate the variance explained by the regression
model:
As with the other statistics that we have studied the slope and intercept are sample statistics based on data
that includes some random error, e: y + e = a + b x. We are of course actually interested in the true
population parameters which are defined without error. y = a + b x. How do we assess the significance
level of the model? In essence we want to test the null hypothesis that b=0 against one of three possible
alternative hypotheses: b>0, b<0, or b not = 0.
There are at least two ways to determine the significance level of the linear model. Perhaps the easiest
method is to calculate r, and then determine significance based on the value of r and the degrees of
freedom using a table for significance of the linear or product moment correlation coefficient. This
method is particularly useful in the standardized regression case when b=r.
The significance level of b, can also be determined by calculating a confidence interval for the slope. Just
as we did in earlier hypothesis testing examples, we determine a critical t-value based on the correct
number of degrees of freedom and the desired level of significance. It is for this reason that the random
variables x and y must be bivariate normal.
For the linear regression model the appropriate degrees of freedom is always df=n-2. The level of
significance of the regression model is determined by the user, the 95% or 99% levels are generally
used. The standard error values of the slope and intercept can be hard to interpret, but their main purpose
is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a
95% chance that the 95% confidence interval of the slope contains the true value of the slope, and that the
95% confidence interval for the intercept contains the true value of the intercept.
The confidence interval is then defined as the product of the critical t-value and Sb, the standard error
of the slope:
whereSb is
defined as:
74
Interpretation.
If there is a significant slope, then b will be statistically different from zero. So if b is greater than (t-
crit)*Sb, the confidence interval does not include zero. We would thus reject the null hypothesis that b=0
at the pre-determined significance level. As (t-crit)*Sbbecomes smaller, the greater our certainty in beta,
and the more accurate the prediction of the model.
If we plot the confidence interval on the slope, then positive and negative limits of the confidence interval
of the slope plot as lines that intersect at the point defined by the mean x,y pair for the data set. In effect,
this tends to underestimate the error associated with the regression equation because it neglects the role of
the intercept in controlling the position of the line in the cartesian plane defined by the data. Fortunately,
we can take this into account by calculating a confidence interval on line.
Just as we did in the case for the confidence interval on the slope, we can write this out explicitly as a
confidence interval for the regression line, that is defined as follows:
The degrees of
freedom is still df= n-2,
but now the standard
error of the
regression line
is defined as:
75