0% found this document useful (0 votes)

39 views75 pages

Statistics Course

Uploaded by

awokegoshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views75 pages

Statistics Course

Uploaded by

awokegoshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 75

CHAPTER ONE

INTRODUCTION TO STATISTICS

1.1 Definition Basic Terms

What does the word statistics mean? To most people, it suggests numerical facts or data, such as
unemployment figures, farm prices, or the number of marriages and divorces. The most common
definitions of the word statistics are as follows:

 Statistics is the science of planning studies and experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions
based on the data (Triola, 2012).
 Statistics is facts or data, either numerical or qualitative,organized and summarized so as
to provide useful and accessible informationabout a particular subject (Weiss, 2012).
 Statistics is the science of organizing and summarizing analyzing numerical or
anddecisions(Weiss, 2012).
 Statistics is the science of collecting, organizing, analyzing, and interpretingdata in order
to make decisions
 A statistics is some piece of information that is presented in numerical form - focuses on
appropriate ways to collect, codifies, analyze, and interpret numerical information (Dunn,
2001).For example,
 A nation's 5% unemployment rate
 received 61% ofthe popular vote
 Academic achievement and status of students
 Attrition and dropout rate of students in Woldia University across colleges
and departments
 Budget allocated for Woldia University from 2004 to 2010 E.C

Importance of statistics: Usingstatistics has different benefits. Some of them are:

 to select an appropriate statistical test
 to collect the right kinds of information for analysis
 to perform statistical calculations in a straightforward, step-by-step manner
 to accurately interpret and present statistical results

1
 to be an intelligent consumer of statistical information
 to write up analyses and results in American Psychological Association (APA) style

1.2. Types of Statistics: Descriptive vs. Inferential

Statistics

Descriptive Inferencial
Statistics statistcs

Figure 1:Branches of statistics – there are two types/ branches of statistics

Inferential statistics

Consists of methods for drawing and measuring the reliability of conclusions about a
population based on information obtained from a sample of the population.
permit generalizations to be made about populations based on sample data drawn from
them
Use statistics, which are measures of asample, to infer values of parameters,which are
measures of a population.
is the branch of statistics that involves using a sample todraw conclusions about a
population

Inferential statistics- t- test, correlation, ANOVA, MANOVA, regression, factor analysis that
use sample data and generalize the findings to the population

Descriptive statistics

Consists of methods for organizing and summarizing information

2
statistical procedures that describe, organize, and summarize the main characteristics of
sample data
Simply describe the set of data at hand
is the branch of statistics that involves the organization,summarization, and display of
data

Descriptive statistics use – ratio, percentage, mean, tables, graphs, figures, charts, standard
deviations, diagram, range

Practical Example 1- Decide which part of the study represents the descriptive of
statistics. What conclusions might be drawn from the study using inferential
statistics?

1. A large sample of men, aged 48, was studied for 18 years. For unmarried
men, approximately 70% were alive at age 65. For married men, 90% were
alive at age 65.
2. A survey conducted among 1017 men and women by Opinion
ResearchCorporation International found that 76% of women and 60% of
men had aphysical examination within the previous year
 Solution for question1
Descriptive statistics involves statements such as
 “For unmarried men, approximately 70% were alive at age 65”
 “For married men, 90% were alive at 65.”

The inference drawn from the study is that

 Being married is associated with a longer life for men

 Solution for question 2
 Descriptive statistics involve the statement - “76% of women and 60% of men
had a physical examination within the previous year.”

3
 An inference drawn from the study is that - a higher percentage of women had a
physical examinationwithin the previous year

Data

 are collections of observations (such as measurements, genders, survey responses)

 Consist of information coming from observations, counts, measurements,or responses.

Table 1-1 Data Used for Analysis

Participants Creativity Pedagogy

1 82 84
Data taken from
2 67 77 students’ grade report
3 80 86 for further analysis
4 69 66
5 56 68
6 90 94
7 85 95

Data Sets

There are two types of data sets you will use when studying statistics. These data sets are called
populations and samples

Population

The complete collection of all individuals (scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that it includes all of the individuals to be
studied
The collection of all individuals or items under consideration in a statistical study
Complete set of events in which you are interested.
is the collection of all outcomes, responses, measurements, or counts that are of interest
For instance
 if we were interested in thestress levels of all adolescent Americans, then the
collection of all adolescentAmericans’ stress scores would form a population,
4
 the scores of all morphine-injected mice
 the milk production of all cows in the country
 Theages at which every girl first began to walk
 the stress scores of the sophomore class in Woldia University

The population can range from a relatively small set of numbers, which is easily collected, to an
infinitely large set of numbers, which can never be collected completely. Ifthe populations in
which we are interested are usually quite large, then, collecting data can be difficult for
researchers so the researchers can collect data from a representative sample taken from the
population.

Census

The collection of data from every member of the population

If a researcher take the whole population, then, no need of sampling to select the research
participants
A census consists of datafrom an entire population.But, unless a populationis small, it is
usuallyimpractical to obtainall the populationdata. In most studies,information must
beobtained from a sample.

Sample

A subcollection of members selected from a population

A part of the population from which information is obtained
Set of actual observations; subset of a population
N.B - A sample should be representative of a population so that sample data can be used to form
conclusions about that population. Sample data must be collected using an appropriate
methodsuch as random sampling

Practical example 2- Identify the population and the sample

 In a recent survey, 1500 adults in the United States were asked if they thought there was
solid evidence of global warming. Eight hundred fifty-five of the adults said yes.
Solution
 The population consists of the responses of all adults in the United States

5
 The sample consists of the responses of the 1500 adults in the United States in
the survey.

Responses of all adults in the Responses of adults

United States (population) in survey (sample)

Figure 2: population and sample

Parameter

 is a numerical description of a population characteristic

 a measure of population
 numerical measurement describing some characteristic of apopulation

Statistic

 is a numerical description of a sample characteristic

 a measure of a sample
 is a numerical measurement describing some characteristic of asample.

N.B -It is important to note that a sample statistic can differ from sample to sample
whereas a population parameter is constant for a population.

Practical example 3:Decide whether the numerical value describes a population parameter
or a sample statistic.

1. A recent survey of 200 college career centers reported that the average starting salary
for petroleum engineering majors is $83,121.
2. The 2182 students who accepted admission offers to Northwestern University in 2009
have an average SAT score of 1442
3. In a random check of a sample of retail stores, the Food and Drug Administration found
that 34% of the stores were not storing fish at the proper temperature.

Solution

6
1. The average of $83,121 is based on a subset of the population, it is a sample statistic.
2. The SAT score of 1442 is based on all the students who accepted admission offers in 2009, it
is a population parameter.
3. The percent of 34% is based on a subset of the population, it is a sample statistic.

1.3 Types of Variables

Types of Data

When doing a study, it is important to know the kind of data involved. The nature of the data you
are working with will determine which statistical procedures can be used.Data sets can consist of
two types of data: qualitative data and quantitative data.

Qualitative data

 Consist of attributes, labels, names or nonnumerical entries.

 consist of names or labelsthat are not numbers representing counts or measurements.
 Are also called categorical data
For example
- eye colors of green and brown
- Numbers 24, 28, 17, 54, and 31 are sewn on the shirtsof the LA Lakers starting
basketball team. These numbers are substitutes fornames. They don’t count or
measure anything, so they are categorical data.
- The political party affiliations (Democrat, Republican,Independent, other) of
survey respondents
Quantitative data
 Consist of numerical measurements or counts.
 Consist of numbers representing counts ormeasurements.
For example
- The ages (in years) of survey respondents
- Measurements of height and weight
- Academic achievement of students
- Time used by students in the classroom
- Number of students in Woldia University

7
Practical Example 4 -As shown in the table below, which data are qualitative data and which
data are quantitative data
Qualitative data

Table1.2 qualitative data and which data are quantitative data

Car models Price in Birr
Focus Sedan 15,995
Fusion 19,270
Mustang 20,995
Edge 26,920
Flex 28,495 Quantitative data

Escape 32,260

THE CONCEPT OF A VARIABLE

A variable is any factor that can be measured or have a different value. Such factors can vary
from person to person, place to place, or experimental situation to experimental situation.
A variable is anything that can take on different values
For examples

the speed with which we drive down the street

the intensity of our caring for another person
the color of our skin
The height and weight of students
Academic achievement of students

There different types of variables which can be important in conducting research

1. Discrete variable
data are expressed in number with possible values is either a finite number or a
“countable” number 0, 1, 2, 3, 4,5 and so on

8
A quantitative variable whose possible values are counting numbers but not fractional
numbers
"discrete;' variable is used to characterize data interms ofwhole numbers (1,2,3, and so
on) with no fractional counts occurring betweenthem.

For example

- Number of children in a family - number students in a classroom

- Number of cows in a village - Births and deaths per year
- The numbers of eggs that hens lay arediscrete data because they represent
counts.
2. Continuous variables
 Quantitative, score, metric, ungrouped variables
 can take on any numerical value on a scale, and there exists an infinite number of
values between any two numbers on a scale
 Data from infinitely many possible valuesthat correspond to some continuous scale
that covers a range of values without gaps, interruptions, or jumps.
 Are those variables which differ in degree rather thankind. These could be measured
on interval or ratio scales.
For example
 The amount of milk a cow gives in a day
 Academic achievement of students
 The intensity of stress among students
 Aggression of children
 Time, age, intelligence, creativity and behavior
3. Categorical variables.
 Also named as nonmetric, dichotomous, grouped, classification Variables
 A variable with different levels, groups, categories and classification
 Qualitative data consist of names or labelsthat are not numbers representing counts
or measurements.

9
 Qualitative variables are those variables which differ in kind rather thandegree. These
could be measured on nominal or ordinal scales.
For example
 Gender - females and males
 Political parties – liberals, democratic, republican and so on
 Grade levels – grade 1, grade 2 or 1st year, 2nd year, 3rd year
 Economic status - destitute, poor, rich, wealthy
 Academic status – warning, probation, promoted
 Colleges – Educations, FBE, Technology, Agriculture ……
1.4 Scales / levels/ of measurement
Measurement represents a set of rules informing us of how values are assigned to objects or
events. Stevens, in 1946 identified four scales in his theory: nominal, ordinal, interval, and ratio
scales in that order. Each scale includes an extra feature or rule over those in the one before it.
We will add a fifth scale to Steven’s treatment, summative response scaling, placing it between
the ordinal and the interval scale.
1. Nominal Scales
 an observation is simply given a name, a label, or otherwise classified
 Nominal scales use numbers, but these numbers are not in any mathematical relationship
with one another.
 Anominal scale uses numbers to identify qualitative differences among measurements.
 The measurements made by a nominal scale are names, labels, or categories, and no
quantitative distinctions can be drawn among them.
 More qualitative and provide less information
 the lowest level of measurement are nominal scales
 categorical variables that represent different categories
 shows membership or member of a category
 the data are organized in the form of frequency counts for a given category
 Frequency counts simply tell us how manypeople we have in each category.
For example - Gender (1 = male, 2 = female), Ethnicity or religion of person, Smoker vs.
nonsmoker, literate versus illiterate,
2. Ordinal scales

10
 The measurement of an observation involves ranking or ordering based on an underlying
dimension.
 An ordinal scale ranks or orders observations based on whether they are greater than or
less than one another
 Ordinal scales do not provide information about how close or distant observations are
from one another.
 An ordinal scale of measurement uses numbers to convey “less than” and“more than”
information. This most commonly translates as rank ordering. Objects may be ranked in
the order that they align themselves on somequantitative dimension but it is not possible
from the ranking information to determine how far apart they are on the underlying
dimension.
3 INTERVAL SCALES
 Interval scales of measurement have all of the properties of nominal ordinal, and
summative response scales.
 The most common illustrations of an equal interval scale are the Fahrenheit and Celsius
temperature scales.
 According to Stevens “Equal intervals of temperature are sealed off by noting equal
volumes of expansion.Eg” Essentially, the difference in temperature between 30 and 40◦
F is equal to the difference between 70 and 80◦ F.
 A less-obvious but important characteristic of interval scales is that they have
arbitrary zero points.
For example, the term zerodegrees do not mean the absence of temperature – on the
Celsius scale, zero degrees is the temperature at which water freezes.
 As was true for summative response scales, it is meaningful to average data collected on
an interval scale of measurement.
 The average high temperature in our home town last week was 51.4◦ F

4. RATIO SCALES

 A ratio scale of measurement has the properties of nominal, ordinal, summative

response, and interval scales
 It has an absolute/true/ zero point, where zero means absence of the property

11
 Ratio scales are time and measures of distance.
 interpret in a meaningful way ratios of the numbers on these scales
 four hours is twice as long as two hours or that three miles is half the distance of six
miles
CHAPTER TWO
ORGANIZING AND PRESENTING DATA
1.1 Raw Data

Raw data areprimary data or secondary data (e.g., numbers, instrument readings, figures, etc.)
collected from a source. Raw data

data that are not summarized,

not providing a meaning to the data
not interpreted to give some kind of meaning
has not been subjected to processing, "cleaning" by researchers to remove outliers

2.2 Organizing & Graphing Qualitative Data

2.2.1 Frequency distribution for qualitative Data

Qualitative data are values of a qualitative (nonnumerically valued) variable.One way of

organizing qualitative data is to construct a table that gives the number oftimes each distinct
value occurs. The number of times a particular distinct value occursis called its frequency (or
count).
A frequency distribution of qualitative data is a listing of the distinct values and their
frequencies. Frequency distribution provides a table of the values ofthe observations and how
oftenthey occur

To Construct a Frequency Distribution of Qualitative Data, there are three steps

Step 1 - List the distinct values of the observations in the data set in the first column of a table.

Step 2 - For each observation, place a tally mark in the second column of the table in the row of
the appropriate distinct value.

12
Step 3 - Count the tallies for each distinct value and record the totals in the third column of the
table.

Practical example 6: Frequency Distribution of Qualitative Data

What is the highest level of education you have completed (please tick)?The responses of the 40
participants in the study are given in Table below. Determine a frequency distribution of these
data
❐ 1. Illiterate ❐ 4.Technique/College/
❐ 2.Primary school ❐ 5. Undergraduate university
❐ 3.secondary school ❐ 6. Postgraduate

Table 3: Categorical data for frequency distribution

Undergraduat
Illiterate Primary Primary Illiterate Primary Primary Primary
e
Tech/ Undergraduat
Illiterate Postgraduate Illiterate Tech/College Primary Primary
College e
Undergraduat Tech/
Primary Postgraduate Tech/College Secondary Secondary Secondary
e College
Secondar Undergraduat
Primary Tech/College Postgraduate Secondary primary Secondary
y e
Undergraduat Tech/ Undergraduat
Primary Secondary Tech/College Secondary Secondary
e College e

Solution

Step 1 - List the distinct values of the observations in the data set in the first column of a table.

The distinct values of the observations are illiterate,primary, secondary, college/technique,

undergraduate degree and post graduate which we list in the first column of Table 3

Step 2 - For each observation, place a tally mark in the second column of the table in the row of
the appropriate distinct value.

13
Step 3 Count the tallies for each distinct value and record the totals in the third column of the
table. Counting the tallies in the second column of Table, gives the frequencies in the third
column of Table. The first and third columns of Table and provide a frequency distribution for
the data in Table.

Table 4: Frequency distribution Table for categorical data

category tally Frequency count

Illiterate //// 4

Primary //// //// // 12

Secondary //// /// 8

What is the highest level of
Technique /College //// // 7
education you have completed?
Undergraduate //// / 6

Postgraduate /// 3

Total 40

2.2.2 Relative Frequency & Percentage Distribution

In addition to the frequency that a particular distinct value occurs, we are often interested in the
relative frequency, which is the ratio of the frequency to the total number of observations:

There are two steps to findrelative frequency

Step 1 Obtain a frequency distribution of the data.We obtained a frequency distribution of the
data from

❑
Frequency
Relative frequency=
Number of total observations

❑
Frequency of category
Relative frequency a category=
Number of total observations

❑ ❑
Frequency of Democratic 4
Relative F for illiterate= = =0.1
Number of total observations 40

14
❑ ❑
Frequency of primary 12
Relative F for primary= = =0.3
Number of total observations 40

❑
Frequency of others 8
Relative F for secondary= = =0.2
Number of total observations 40

Table 5: Relative frequency distribution Table

relative
category tally Frequency frequency
Illiterate 4 10
Primary 12 30
Secondary 8 20
What is the Technique 7
highest level /College 17.5
of education Undergraduate 6 15
you have Postgraduate 3 7.5
completed? Total 40 100

2.2.3 Graphical presentation of qualitative Data

2.2.3.1 Bar Graph

A bar graph is a graph that displays the frequency or numerical distribution of a categorical
variable, showing values for each bar next to each other for easy comparison. A bar chart is a
graphical display of data that have been classified into a number of categories. Equal-width
rectangular bars are used to represent each category, with the heights of the bars being
proportional to the observed frequency in the corresponding category.

Bar Graph Characteristics

1. Data can be quantitative or categorical

15
2. Bars can be vertical or horizontal

3. The x-axis represents the category displayed

4. The y-axis represents the quantitative values of the variable being displayed

5. Bars are of uniform width and uniformly spaced

6. A consistent measurement scale is used for each vertical bar

7. Height of bars represent the values of the variable displayed, the frequency of occurrence or
percentage of occurrence

8. The graph is well-annotated with title, labels for each bar, vertical scale, horizontal categories,
source

what is the highest level of education

you have completed?
120.0
100.0
80.0
60.0 Percent
40.0
20.0
0.0
e y ry ge te te l
at ar ta
er da lle ua ua To
Illit rim co
n
/C
o ad
ra
d
P
Se gr stg
que der o
ni Un
P
ech
T

The other way of presenting the data using bar graph is

16
What is the highest level of education
you have completed?
What is the highest level of education you have completed?
30%
20% 18% 15%
10% 8%

te y y e e e
ra ar d ar lleg uat uat
te rim n d d
Illi P co Co ra ra
Se ue/ erg stg
iq d Po
chn Un
Te

Pie Chart

 A pie chart is a disk divided into wedge-shaped pieces proportional to the relative
frequencies of the qualitative data

Another method for organizing and summarizing data is to draw a picture of some kind. The old
saying “a picture is worth a thousand words” has particular relevance in statistics—a graph or
chart of a data set often provides the simplest and most efficient display. Two common methods
for graphically displaying qualitative data are pie charts and bar charts. We begin with pie charts.

To Construct a Pie Chart

Step 1 Obtain a relative-frequency distribution of the data

Step 2 Divide a disk into wedge-shaped pieces proportional to the relative frequencies.
 We see that, in this case, we need to divide a disk into three wedge-shaped pieces that
comprise 32.5%, 45.0%, and 22.5% of the disk.
Step 3 Label the slices with the distinct values and their relative frequencies.
 Notice that we expressed the relative frequencies as decimal or percentage.

17
What is the highest level of education
you have completed?
Illiterate Primary Secondary
Technique /College Undergraduate Postgraduate

10%
8%
15%
30%
18%

20%

Every Pie Chart COULD be made into a Bar Graph

BUT
NOT all Bar Graphs CAN be made into Pie Charts

Pie Chart Bar Graph

Bar Graph not Pie Char

 Bar Graphs are easier to make & to read than pie charts
 Both pie charts & bar graphs can display the distribution of a categorical variable
 A bar graph can also compare any set of quantities measured in the same units
Organizing Quantitative Data using frequency distribution
To organize quantitative data, we first group the observations into classes. Consequently, once
we group the quantitative data into classes, we can construct frequency and relative-frequency
distributions of the data in exactly the same way as we did for qualitative data. Several methods
can be used to group quantitative data into classes. Here we discuss two of the most common
methods: single-value grouping and limit grouping

Single-Value Grouping

18
In some cases, the most appropriate way to group quantitative data is to use classes in which
each class represents a single possible value. Such classes are called singlevalue classes, and this
method of grouping quantitative data is called single-value grouping.

Practical example 5: Frequency distribution for ungroup data

Table 7: Test scores taken from first year students in statistics class

Score Sex Score Sex Score Sex Score Sex

6 F 4 F 9 M 7 M
5 M 5 M 2 M 7 M
4 F 5 F 2 M 7 M
9 F 9 F 6 F 4 M
10 F 10 F 7 F 7 F
Based on the above table above answer the following questions

1. What percentage of respondents are female and Male

2. What percentage of female arescore 5 and below
3. What percentage of male arescore 5 and below
4. What percentage of respondents arescore 5 and below
5. Prepare a frequency distribution table for students test score
6. Generate a bar graph to assess students test score
Constructing Frequency Distribution for Grouped Data

A second way to group quantitative data is to use class limits. With this method, each class
consists of a range of values. The smallest value that could go in a class is called the lower limit
of the class, and the largest value that could go in the class is called the upper limit of the class.
This method of grouping quantitative data is called limit grouping. It is particularly useful when
the data are expressed as whole numbers and there are too many distinct values to employ single-
value grouping.

Terms Used in Grouping Data

 Lower class limit: The smallest value that could go in a class.

 Upper class limit: The largest value that could go in a class.

19
 Class width: The difference between the lower limit of a class and the lowerlimit of the
next-higher class.
 Midpoint: The average of the two class limits of a class.
Table 10: Grouped data frequency distribution

Question1. Construct a frequency table (with class boundary)

Solution:
Step1. Finding the range
Range = Highest score – Lowest score
Range = 54 – 18 = 36

Step2. Determine the class width (i)

Range 36
Class width (i) = = 7.2 (round to 8)
Numberofinterval 5

Step3. List the limits of each class interval

To set the lower and upper boundary 0.5 is subtracted from the lower limit and added to the
upper limit boundary of each class interval. Therefore, the class boundary of the distribution is
organized as follows

Class boundary Frequency

17.5 – 25.5 13
25.5 – 34.5 8
34.5 – 41.5 4
41.5 – 49.5 3
49.5 – 57.5 2

20
UNIT THREE

2. MEASURES OF CENTRAL TENDENCY

 Central tendency is a statistical measure that determines a single value that accurately
describes the center of the distribution and represents the entire distribution of scores
 The goal of central tendency is to identify thesingle value that is the best representative
forthe entire set of data.

 Central tendency serves as a descriptivestatistic because it allows researchers todescribe

or present a set of data in a verysimplified, concise form.

Characteristics of a good measure of central tendency

21
Measure of central tendency is a single value representing a group of values and hence
issupposed to have the following properties.

1. Easy to understand and simple to calculate.

A good measure of central tendency must be easy to comprehend and the procedure involved in
its calculation should be simple.

2. Based on all item

A good average should consider all items in the series.

3. Rigidly defined

A measure of central tendency must be clearly and properly defined. It will be better if itis
algebraically defined so that personal bias can be avoided in its calculation.

3. Capable of further algebraic treatment

A good average should be usable for further calculations.

5. Not be unduly affected by extreme values

A good average should not be unduly affected by the extreme or extra ordinary values in a series.

The most common measures of central tendency are

3.1. The mean

 The sum of all the data entries divided by the number of entries
 The mean, also known as the arithmetic average, is found by adding the values of thedata
and dividing by the total number of values
 The mean is the sum of the values, divided by the total number of values.
3.1.1. Properties of Mean
 It is simple to understand and easy to calculate
 It takes into account all the items of the series
 It is rigidly defined and is mathematical in nature
 It is relatively stable

22
 It is capable of further algebraic treatment
 Mean is the center in balancing the values on either side of it and hence is more typical
 The mean is sensitive to the exact value of all the scores in the distribution
 The sum of the deviations about the mean equals zero
3.1.2 Computing Means of Ungrouped Data

sum of all x
x−bar =
number of x
Or

Example: The following data represents the ages of 20 students in a statistics class. Calculate the
meanage ofstudents.

20 20 20 20 20 20 21
21 21 21 22 22 22 23
23 23 23 24 24 65

3.1.3 Computing Mean for Grouped data

Finding the Mean of a Frequency Distribution

Example:The following data represents the ages of 30 students in a statistics class.
Construct a frequency distribution that has five classes.

23
Step1 – Prepare class interval or boundary

Step2- Find the midpoint of each class

Step 3 -Find the sum of the products of the midpoints and the frequencies.

Step 4 - Find the sum of the frequencies.

Step 5 - Find the mean of the frequency distribution

Cumulativ
Class
Frequency e Midpoints (x) (x∙f)
interval
frequency
18 – 25 13 13 21.5 279.5
26 – 33 8 21 29.5 236
34 – 41 4 25 37.5 150
42 – 49 3 28 45.5 136.5
50 – 57 2 30 53.5 107
N = 30 Σ(x∙f) = 909

Then, Mean=
∑ ( f . x ) = 909 = 30.3
N 30

Therefore, the average age of students is 30.3 years

Example 2: calculate the mean of the following data set

24
Class Interval Frequency Cumulative Frequency FX Midpoints (x)
(F)
1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 44 22
4 24.5 – 29.5 7 11 189 27
5 29.5 – 34.5 3 14 96 32
6 34.5 – 39.5 2 16 74 37
7 39.5 – 44.5 4 20 168 42

Mean =
∑ fx = 600 = 30
N 20

3.2 The Median

 Median is a point in the data set above and below which half of the cases fall.
 The median of a data set is the measure of center that is the middle
value when
the original data values are arranged in order of increasing (or
decreasing)
magnitude.
 The median is the middle score of a data set if the scores are organized from the
smallest to the largest.
 The median is a number or score that precisely divides adistribution of data in
half. Fifty percent ofadistribution's observations will fall above the median and
fifty percent will fall below it.
 The middle number in an ordered set of numbers. Divides the data into two equal
parts.

3.1.1. Properties of Median

25
 The median can be used for calculations involving ordinal, interval, or ratioscale
data
 difficult to compute because data must be sorted
 best average for ordinal data
 unaffected by extreme data

3.1.2 Computing Median of Ungrouped Data

 If a data set is odd in number, the median falls exactly on the middle number.
 If a data set is even in number: the median is the average of the two middle values.

For an odd number of scores,here is a data set of 15 scores to consider

26 32 21 12 15 11 27 16 18 21 19 28 10 13 31
Step 1: To calculate the median, arrange the scores from the lowest to the highest:

10 1112 1315 16 18 19 21 21 26 2728 31 32

Step 2: The location of the median score can be found by taking the middle value or using a
N +1
simple formula: Median =
2
= 15+1
2
=8

3.1.3 Computing Median for Grouped data

Based on the following frequency distribution, answer the questions given below the data.

Class Interval Frequency Cumulative Frequency FX Midpoints (x)

(F)
1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 44 22
4 24.5 – 29.5 7 11 189 27
5 29.5 – 34.5 3 14 96 32
6 34.5 – 39.5 2 16 74 37
7 39.5 – 44.5 4 20 168 42

26
Questions

Calculate median of the frequency distribution.

There are steps for the calculation of the median in frequency distribution
n
Step2: Find ( )to identify the median class
2
n
Step3: See in the cumulative frequency the value first greater than ( ), Then the corresponding
2
class interval is called the Median class.
Step 4:Calculate the median of the distribution
20
Median class (
2
) = 10
n
−m
Median = L + 2 xC
f
Where: n = the total number of scores
L = the lower limit of the median class
m = the frequency before the median class
f = frequency of the median class
c = class width
The median lies between 4 and 11. Corresponding to 4 the less than class is 24.5 and
corresponding to 11 the less than class is 29.5. Therefore the median class is 24.5-29.5. Its lower
limit is 24.5.
Here L = 24.5, n= 20, f = 7, c = 20, m =4
10−4 6
Median = 24.5 + x5= 24.5 + x 5 = 24.5 + 4.28 = 28.285
7 7

3.3 The Mode

 Mode is the most frequently occurring value in a data set

 The mode is the most frequently occurring category of score.
 It is merely the most common score or most frequent category of scores

27
3.1.1. Properties of mode

 can apply the mode to any category of data

 The mode is the only measure that applies to nominal (category) data aswell as
numerical score data.
 You can have a single number for the mode, no mode, or more than one number.
 best average for nominal data
 easy to determine
 When two data values occur with the same greatest frequency, each one is amode
and the data set is bimodal.
 When more than two data values occur with the same greatest frequency, each isa
mode and the data set is said to be multimodal.
 When no data value is repeated, we say that there is no mode.
3.1.2 Computing Mode of Ungrouped Data
 Identify the number that occurs most often.
 Organize frequency distribution to identify the most frequent score in distribution
For example:
10 11 12 13 15 16 18 19 21 21 26 27 28 31 32
 21 is the mode of the data set

3.1.3 Computing Mode for Grouped data

Based on the following frequency distribution, answer the questions given below the data.

Class Interval Frequency (F) Cumulative Frequency Midpoints (X) FX

1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 22 44
4 24.5 – 29.5 7 11 27 189
5 29.5 – 34.5 3 14 32 96
6 34.5 – 39.5 2 16 37 74
7 39.5 – 44.5 4 20 42 168

28
Questions

Calculate mode of the frequency distribution

The modal class can be easily identified compared with median. The modal class can be
observed with the higher frequency in frequency of the distribution. Then, the modal class is 24.5
to 29.5.

Step 2:Calculate the mode of the distribution

fs
Mode =L + xC
fs+ fp

When L – the lower limit of the median class

Fs =Here L = 24.5, f = 7, c = 5, fs = 3, fp= 2

3 3
Mode = 24.5 + x 5 = 24.5 + x 5 = 24.5 + 3 = 27.5
3+2 5

Class work

Test scores taken from first year students in statistics class

Score Sex Score Sex Score Sex Score Sex

6 F 4 F 9 M 7 M
5 M 5 M 2 M 7 M
4 F 5 F 2 M 7 M
7 M 6 M 7 F 9 F
7 M 7 M 4 F 10 M
7 F 8 M 5 M 5 M
9 F 9 F 6 F 4 M
10 F 10 F 7 F 7 F

29
10 M 2 M 8 M 7 F
2 F 8 F 9 F 6 M
Based on the above table above answer the following questions

1. What is the average test result of the sample?

2. What is the median test result of the sample?
3. What is the mode of the sample?
4. What percentage of the sample indicated that they had a problem with their
academic achievement?

UNIT FOUR

4. MEASURES OF DISPERSION/ VARIATION

Measures of variability provide information about the amount of spread or dispersion among the
variables. Range, variance, and standard deviation are the common measures of variability.

4.1 Range, standard deviation and variance

Range

 Simply the difference between the largest and smallest values in a set of data
 Is considered primitive as it considers only the extreme values which may not be useful
indicators of the bulk of the population.
 The formula is - Range = largest observation - smallest observation
 is the difference between the largest and the smallest values.
 used for ordinal data
Range = the highest – the lowest scores

Standard deviation

 Measures the variation of observations from the mean

 Isthe positive square root of variance
 The most common measure of dispersion
 Takes into account every observation

30
 Measures the ‘average deviation’ of observations from the mean
 used on ratio or interval data
 The standard deviation measures the variation among data values.
 Values close together have a small standard deviation, but values with muchmore
variation have a larger standard deviation.
 For many data sets, a value is unusual if it differs from the mean by more thantwo
standard deviations

Steps in Calculating Standard deviation

For example – The following are assessment scores of students in Abnormal psychology

Then, calculate the variance and standard deviation of the data set

31
Sum of square x = 88.5

Sample Variance

2
2 = Σ ( x−x ) 88.5
S n−1
= 10−1 = 9.83

SD = √ 9.83 = 3.135
Or

nΣ(x ¿¿ 2)−¿ ¿ ¿ ¿ = 9.83

SD =√ 9.83 = 3.135

Variance

 is the sum of the squared deviations of each value from the mean divided by the number
of observations
 mean of squared differences between scores andthe mean
 used on ratio or interval data
 used for advanced statistical analysis

32
 is equal to the average of the squared deviations from the mean of a distribution.
Symbolically, sample variance is s2and population variance is
For example:

Classwork - Test

Find the variance

4.3 Measures of position

Measures of position tell where a specific data value falls within the data set or its relative
position in comparison with other data values.

4.3.1 Quartiles &Interquartile Range

Interquartile Range

 Measures the range of the middle 50% of the values only

 Is defined as the difference between the upper and lower quartiles
 Interquartile range = upper quartile - lower quartile

= Q3 - Q1

Find a quartile by determining the value in the appropriate position in the ranked data, where

First quartile position = Q1 = (n+1)/4 ranked value

Second quartile position = Q2 = (n+1)/2 ranked value

Third quartile position = Q3 = 3(n+1)/4 ranked value

Where n is the number of observed values

33
For example: Sample Ordered Data: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
soQ1 = (12+13)/2 = 12.5
Q2 is in the(9+1)/2 = 5th position of the ranked data,
soQ2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
soQ3 = (18+21)/2 = 19.5
Then, Interquartile range = Q3 – Q1 = 19.5 - 12.5 = 7

When calculating the ranked position use the following rules

 If the result is a whole number then it is the ranked position to use

 If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then average the two
corresponding data values.
 If the result is not a whole number or a fractional half then round the result to the
nearest integer to find the ranked position.
4.3.3 Percentiles
 are measures of location, denoted which dividea set of data into 100 groups
with about 1% of the values in each group
 Percentiles are merely a form of cumulative frequency distribution, but instead of being
expressed in terms of accumulating scores from lowest to highest, the categorisation is in
terms of whole numbers of percentages of people.
 The percentile is thescore which a given percentage of scores equals or is less than.
 examined to find the cut of points in a given data set
For example - 80% of scores are equal to 61 or less

For example, the 50th percentile, denoted, has about 50% of the data values below it and about
50% of the data values above it. So the 50th percentile is the same as
the median. There is not universal agreement on a single procedure for calculating
percentiles, but we will describe two relatively simple procedures for

34
(1) Finding the percentile of a data value,

(number of data values below X )

Percentile Rank= ∗100 %
total number of values

x
p= ∗100
n

Sorted data = 3, 4, 5, 6, 7, 9, 12, 15, 20, 22, 23, 24, 25

Find the percentile of 22

P = n < X/ N* 100% = 9/13 * 100 = 69.23 = 70% - Then, the above, 70% of students scored
22 and below or only 30 % of students scored above score 22
(2) Converting a percentile to its corresponding data value.
Examples
P30 is the value that divides the lowest 30% of the data from the highest 70% of the data
P70 divides the lowest 70% of the data from the highest 30% of the data

When:

N - Total number of values in the data set

K - Percentile being used (Example: For the 25th percentile, )

L - Locator that gives the position of a value (Example: For the 12th value in

the sorted list,)

Find the value of 25th percentile

L = K/ 100 *N = 25/100 * 100 = 3.25 = 4 position = 6 - This shows that 25% of students scored
6 and below

35
Find the value of 50th percentile

L = K/ 100 N = 50/100 100 = 6.5 = 7 position = 12

4.3.5. Z-score

 Z-scores are merely scores expressed in terms of the number of standard statistical
units of measurement (standard deviations) they are from the mean of the set of scores.
 A z score (or standardized value) is found by converting a value to a
standardized
scale, as given in the following definition. This definition shows that a z score
is the
number of standard deviations that a data value is deviated from the mean.
 A z score (or standardized value) is the number of standard deviations
that
a given value x is above or below the mean
 We used the range rule of thumb to conclude that a value is “unusual”
if it is more than 2 standard deviations away from the mean. It follows that
unusual
values have z scores less than-2 or greater than + 2.

36
 A positive z-score means that a score is above the mean.
 A negative z-score means that a score is below the mean.
 A z-score of 0 means that a score is the exact sameas the mean
For example

A student scored a 65 on a math test that had a mean of 50 and a standard deviation of 10. She
scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative
position on the two tests.

Solution

37
Math: z = (65-50)/10= 15/10 = 1.5

History: z = (30-25)/5 = 5/5 = 1

The student did better in math because the z-score was higher

Example 2

Find the z-score for each test and state which test is better

Test A:

Test B:

Test A: z = (38-40)/5 = -0.4

Test B: z = (94-100)/10 = -0.6

Test A is higher, therefore it is better. It has a higher relative position.

CHAPTER FIVE

38
MEASURES OF RELATIONSHIP

The correlation coefficient, r, measures the strength of the linear relationship between two paired
variables in a sample.Pearson correlation or Spearman correlation is used when you want to
explore the strength of the relationship between two continuous variables. This gives you an
indication of both the direction (positive or negative) and the strength of the relationship. A
positive correlation indicates that as one variable increases, so does the other. A negative
correlation indicates that as one variable increases, the other decreases

3.1 Characteristics of Associations between Variables

1) Main research questions are stated as null hypotheses, i.e., no relationship exists between
AbsenteeismandGPA? Does absenteeismincrease with GPA?
i. H0: r = 0 (there is no relationship)
ii. H1: r ≠ 0 (there is a relationship)
2) In simple correlation, there are two measures for each individual in the sample.
3) For Pearson correlation, there must be at least 30 individuals in the study.
4) Can be used to measure the degree of relationships, not simply whether a relationship
exists.
5) A perfect positive correlation is 1.00;
6) A perfect negativecorrelation (inverse) is -1.00.
7) A correlation of 0 indicates no linear relationship exists.
8) Correlation does not imply CAUSATION!
9) The magnitude or numerical value of a correlationexpresses the strength of the
relationship between the two variables
10) The sign of a correlation coefficientindicates the direction of the relationshipbetween the
two variables
11) Positive correlation – a direct, positiverelationship between two variables; as onevariable
increases, the other variable increases
12) Negative correlation – an inverse, negativerelationship between two variables; as
onevariable increases, the other variable decreases

How do you interpret values between 0 and 1?

39
Different authors suggest differentinterpretations; however, Cohen (1988, pp. 79–81) suggests
the following guidelines:
Category Positive Negative
SMALL r= 0.10 to0.29 r= -0.10 to-0.29

MEDIUM r= 0.30 to 0 .49 r= -0.30 to -0 .49

LARGE r= 0.50 to 1.0 r=-0.50 to-1.0

No linear relationship r=0

The Pearson Correlation Coefficient

The Pearson r is used to advance research beyond the arena of descriptive statistics. Specifically,

the Pearson‘‘r’’ enables investigators to assess the nature of the association between two
variables, X and Y

The Pearson r, a correlation coefficient, is a statistic that quantifies the extent to which two
variables Xand Yare associated, and whether the direction of their association is positive,
negative, or zero.

Apositive correlation is one where as the value of Xincreases, the corresponding value of Y also
increases. Similarly, apositive correlation exists when the value of Xdecreases, the value of Y
also decreases.

Anegative correlation identifies an inverse relationship between variables X and Y-as the value
of one increases, the other necessarily decreases.

Azero correlation indicates that there is no pattern or predictive linear relationship between the
behaviorof variables Xand Y.

Assumptions of Pearson product moment correlation

It needs two continuous/ score/ variables

40
Each participants should have two measurements
Number of participants should be greater than 30
The distribution should be symmetric or normal

Linear Relationships and Scatterplots of Variables X and Y

 Greater linearity in a scatter plot's points indicates a stronger correlation.

 Plotting the X and Y pairs on scatterplot is a good way to visualize a correlational
relationship
 When variables X and Y result in an r of +1.00 or -1.00, they are said to exhibit a perfect
linear relationship. By "linear;' we mean that the relationship between the two variables is
best represented by a straight line on a diagram. The diagram of choice for plotting
variables is called a scatter plot or scatter diagram.
 A scatter plot is a particular graph used to present correlational data. Each point in a
scatter plot represents the intersection of an X value with its corresponding Y value

41
Identify the type of correlation from the
above scatter plot

Computing Correlation Coefficient

For Example:

Academic
Absenteeism
Achievement(Y) XY 2 2
X Y
(X)

0 8 0 0 64
42
2 10 20 4 100

3 4 12 9 16

6 6 36 36 36

9 1 9 81 1

10 3 30 100 9

30 32 107 230 226

43
r = -.797

Considerthe following table as the number of participants in a study

is 300

Correlations
Absenteeism GPA
Absenteeism Pearson Correlation 1 -.797
Sig. (2-tailed) 0.00
N 300
GPA Pearson Correlation -.797 1
Sig. (2-tailed) 0.00
N 300

There was a strong, negative correlation between the two variables, r = –.797, n = 300, p < .0005,
with high levels of absenteeism associated with lower levels of GPA.It implies that the
relationship is negative and significant. This shows that the relationship between absenteeism
and GPA isnegative - as absenteeism increases GPA decreases

Class work

Test1 6 6 5 4 7 4 4 3 6 10 6 6 4 8 12 12 11
Test2 8 4 8 2 4 8 2 5 10 10 10 8 7 12 11 10 9
Students 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Calculate the relationship between Test 1 and Test 2, check its significance and interpret it

3.2 Spearman’s rho Correlation Coefficient

When to use

44
- there are two a ranked data for variable A and variable B
- The data is skewed for from the normal distribution
- If N is less than 30

The Pearson correlation coefficient is the dominant correlation index in psychological statistics.
There is another called Spearman’s rho which is not very different. Instead of taking the scores
directly from your data, the scores on a variable are ranked from smallest to largest. That is, the
smallest score on variable X is given rank 1, the second smallest score on variable X is given
rank 2, and so forth. The smallest score on variable Y is given rank 1, the second smallest score
on variable Y is given rank 2, etc. Then Spearman’s rho is calculated like the Pearson correlation
coefficient between the two sets of ranks as if the ranks were scores.

A special procedure is used to deal with tied ranks. Sometimes certain scores on a variable are
identical. There might be two or three people who scored 7 on variable X, for example.This
situation is described as tied scores or tied ranks. The question is what to do about them. The
conventional answer in psychological statistics is to pretend first of all that the tied scores can be
separated by fractional amounts. Then we allocate the appropriate ranks to these ‘separated’
scores but give each of the tied scores the average rank that they would have received if they
could have been separated

The two scores of 5 are each given the rank 2.5 because if they were slightly different they
would have been given ranks 2 and 3, respectively. But they cannot be separated and so we
average the ranks as follows:

Ranking of a set of scores when tied (equal) scores are involved

Score 4 5 5 6 7 8 9 9 9 10
Rank 1 2.5 2.5 4 5 6 8 8 8 10

In the above Table, there are two 5 scores, for these tied scores, the average rank should be given
 2+3/2 = 2.5
 7+8+9/ 3 = 8

45
There are three scores of 9 which would have been allocated the ranks 7, 8 and 9 if the scores
had been slightly different from each other. These three ranks are averaged to give an average
rank of 8 which is entered as the rank for each of the three tied scores

Participants 1 2 3 4 5 6 7 8 9 10
Test1 for MA 8 3 9 7 2 3 9 8 6 7
Rank1 7.5 2.5 9.5 5.5 1 2.5 9.5 7.5 4 5.5
Test2 for MUA 2 6 4 5 7 7 2 3 5 4
Rank 2 1.5 8 4.5 6.5 9.5 9.5 1.5 3 6.5 4.5
Difference (D) 6 5.5 5 1 8.5 7 8 4.5 2.5 1
D2 36 30.25 25 1 72.25 49 64 20.25 6.25 1

Step1 – State the hypothesis

H0: r = 0 (there is no relationship)

H1: r ≠ 0 (there is a relationship)

Step2 – Find n-2 = 10 - 2 = 8

Step3 - Find critical value using 8 = 0.738

Step4 – Compute Spearman rho correlation coefficient

2
6 ΣD 6 x 305 1830
r (Spearman)=1−
n ( n2 −1 )
= 1− 10 ( 10 −1 ) 1− 990 = -0.85
2

Step5 - Decision making about H0 and H1

Since calculated r = -0.85> Table/critical value = -0.85 at 0.05 significance level, H0 is rejected
and H1 is accepted.

Step6 – Reporting or interpretation

This finding implies that Spearman correlation confident found evidence musical ability was
significantly and inversely related to mathematical ability(r =-0.85, p < 0.05).

Classwork exercise

46
1. A researcher wants to investigate the relationship between time of study per hour and
levels of perceived stress. A data is collected from 10 sample of students which indicated
below

no Study time stress

1 3 10
2 5 13
3 0 15
4 0 14
5 1 12
6 2 11
7 5 10
8 4 14
9 6 15
10 3 16

2. Is there a relationship between students study time per hour and their levels of perceived
stress?
3. Do people with higher study time lower levels of perceived stress or higher level of
stress?

47
CHAPTER SIX

HYPOTHESIS TESTING

6.1 Concepts of Hypothesis Testing

 Hypothesis is usually considered as the principal instrument in research.

 Many experiments are carried out with the deliberate object of testing hypotheses.

 In social science, where direct knowledge of population parameter(s) is rare, hypothesis

testing is used strategy for deciding whether a sample data offer generalization can be
made or not

 Hypothesis testing enables us to make probability statements about population

parameter(s).

6.2 what is Hypothesis

 Hypothesis simply means a mere assumption or some supposition to be proved or

disproved. But in a researcher hypothesis is a formal question that researcher intends to
resolve. A research hypothesis is a predictive statement, capable of being tested by
scientific methods, that relates an independent variable to some dependent variable.

48
 For example, consider the following statement

 A. “Students who receive counseling will show a greater increase in creativity than
students not receiving counseling

Typically, in hypothesis testing, we have two options to choose from. These are termed as null
hypothesis and alternative hypothesis.

NULL HYPOTHESIS VS ALTERNATIVE HYPOTHESIS

 Null hypothesis: Is a statistical hypothesis testing that assumes that the observation is due
to a chance factor. In hypothesis testing, null hypothesis is denoted by; H0: μ1 = μ2,
which shows that there is no difference between the two population means.
 The null hypothesis always states that there is no effect in the underlying population. By
effect we might mean a relationship between two or more variables, a difference between
two or more different populations or a difference in the responses of one population
under two or more different conditions.
 Alternative Hypothesis (H1) - a hypothesis to be considered as an alternative to the null
hypothesis.
Examples of null hypotheses (H0): in the above example
 There is no relationship between study time and exam grade
 There is no difference between female and male participants in exam result
 There is no difference in of exam result after the participants take training on study skill

We use the symbol Ha to represent the alternative hypothesis

 Alternative hypothesis shows that observations are the result of a real effect.
 The alternative hypothesis (H1) is a statement that the null hypothesis is not true.
 It is the statement that must be true if the null hypothesis is false.
Examples of alternative hypotheses (H1): in the above example
 . There is relationship between study time and exam grade
 There is difference between female and male participants in exam result
 There is difference between exam result after the participants take training on study skill

49
6.4 Directional & Non – directional Hypothesis

 Directional hypotheses

 In a study of investigating the relationship between numbers of hours spent

studying per week and ﬁnal examination grade.

 We made the prediction (hypothesized) that, as a hours of study increased, so would

exam grades. This is said to be a directional hypothesis.

We have speciﬁed the exact direction of the relationship between the two variables: as
study hours increased, so would exam grades. This is also called a one-tailed hypothesis

 Non directional hypotheses

 In some study, we are not sure of the exact nature of the relationships

 Suppose, we wish to examine the relationship between anxiety and memory.

 In making such a prediction, there will be a relationship, but are not sure whether
as anxiety increase or decrease memory.

 Therefore would want to predict only that there was a relationship between the two
variables without specifying the exact nature of this relationship this is called two-
tailed hypothesis.

6.5 Errors

Types of error in statistical decision making:-type i and ii error

Type I error

 Suppose we conducted some research and found that the probability of ﬁnding the effect
we observe is small.

 In a study, null hypothesis said there is no relationship between length of hair in male and
number of criminal offence committed .

50
But, we have obviously made a Type I error if we conclude that we have support that for our
prediction is that there will be a relationship between length of hair in males and number of
criminal offences committed

• A Type I error occurs when the sample data appear to show a treatment effect when, in
fact, there is none.
• In this case the researcher will reject the null hypothesis and falsely conclude that the
treatment has an effect.
• Type I errors are caused by unusual, unrepresentative samples. Just by chance the
researcher selects an extreme sample with the result that the sample falls in the critical
region even though the treatment has no effect.

• The hypothesis test is structured so that Type I errors are very unlikely; specifically, the
probability of a Type I error is equal to the alpha level.

Type II Errors

• In this case, the researcher will fail to reject the null hypothesis and falsely conclude that
the treatment does not have an effect.
• Type II errors are commonly the result of a very small treatment effect. Although the
treatment does have an effect, it is not large enough to show up in the research study.

6.6 Significance Level (p-value)

51
 A big difference in mean scores between conditions may be due to the predicted effects
of the independent variable rather than random variability. But there is always a specific
probability that the differences in scores are caused by total random variability. So there
can never be 100 percent certainty that the scores in an experiment are due to the effects
of selecting the independent variable.
 Statistical tests calculate probabilities that results are significant. Statistical tables provide
probabilities that any differences in scores are due to random variability, as stated by the
null hypothesis. This means that the less probable it is that any differences are due to
random variability, the more justification there is for rejecting the null hypothesis. This is
the basis of all statistical tests. Statistical tables give the probability that scores in an
experiment occur on a random basis.
 If the probability that the scores are random is very low, then you can reject the null
hypothesis that the differences are random. Instead you can accept the research
hypothesis that the experimental results are significant, that is, that they are not likely to
be random. Strictly speaking, the only conclusion from looking up probabilities in
statistical tables is that they justify rejecting the null hypothesis. But you will find that, if
the null hypothesis can be rejected, psychological researchers usually claim that the
results provide support for the predictions in the research hypothesis.
 There is always a probabilistic component involved in the accept–reject decision in
testing hypothesis. The criterion that is used for accepting or rejecting a null hypothesis is
called significance level or p-value. The p-value represents the probability of concluding
(incorrectly) that here is a difference in your samples when no true difference exists.
 It is a statistic calculated by comparing the distribution of a given sample data and an
expected distribution (normal, F, t etc.) and is dependent upon the statistical test being
performed.
 For example, if two samples are being compared in a t-test, a p-value of 0.05 means that
there is only a 5% chance of arriving at the calculated t-value if the samples were not
different (from the same population).
 In other words, a p-value of 0.05 means there is only a 5% chance that you would be
wrong in concluding that the populations are different or 95% confident of making a right
decision. For social sciences research, a p-value of 0.05 is generally taken as standard.

52
 In psychology (possibly becauseit is thought that nothing too terrible can happen as a
result of accepting aresult as significant!) there is a convention to accept probabilities of
either1 per cent or 5 per cent as grounds for rejecting the null hypothesis.
 The way levels of significance are expressed is to state that the probability of a result
being due to random variability is less than 1 per centor less than 5 per cent. That is why
in articles in psychological journalsyou will see statements that differences between
experimental conditionsare ‘significant (p < 0.01)’ or ‘significant (p < 0.05)’. This means
that theprobability (p) of a result occurring by chance is less than (expressed as <)1
percent (0.01) or 5 percent (0.05).
 Sometimes, you will find other probabilities quoted, such as p < 0.02 orp< 0.001. These
represent probabilities of obtaining a random result 2times in 100 and 1 time in 1000 (2
per cent and 0.1 per cent). Thesepercentage probabilities give you grounds for rejecting
the null hypothesisthat your results are due to the effects of random variability

6.7 T –test

A t-test examines differences in the mean scores of a parametric dependent variable across two
groups or conditions (the independent variable). As we saw in Chapter 5, data are parametric if
they are represented by interval values and are reasonably normally distributed. The t-test
outcome is based on differences in mean scores between groups and conditions

6.7.1 One Sample T-test

One sample t-test is used to compare the mean of a single sample with the population mean. The
single- or one-sample t test is used to compare the observed mean of one sample with a
population mean. One-sample t tests are usually employed by researchers who want to determine
if some set of scores or observations deviate from some established pattern or standard.

Some situations where one sample t-test can be used are given below:

 An economist wants to know if the per capital income of a particular region is same as
the national average.
 The Quality Control department wants to know if the mean dimensions of a particular
product have shifted significantly away from the original specifications.

53
 Is academic achievement of ECCE department students significantly deviated from the
academic achievement of Woldia University

Computing One Sample Test

Steps for test statistic in one sample t-test

Students 1 2 3 4 5 6 7 8 9
Test 8 7 5 6 8 7 8 6 6
Population mean ( μ) = 5

Step 1 State the null and alternative hypotheses.

❑
H0: x 1 ¿ μ (the sample mean is equal to the population mean – no difference between the sample
mean and the population mean)

❑
H1: x 1 ≠ μ (the sample mean is different from the population)

Step 2 Specify the level of significance = 0.05

Step 3 Determine the degrees of freedom = N– 1 = 9-1 = 8

Step 4 Determine the critical value = from the table = 2.30

Step 5 Determine the rejection region – All values > 2.30

Step 6 Find the test statistic

54
6.77−5 1.77
= =4.86
= 1.092 .364
√9

Step 7 Make a decision to reject or fail to reject H0

 The calculated t-value is 4.86 > the critical value 2.30 at 0.05 significance level. Then,
H0 is rejected.

Step 8 interpret the result

Table2: One sample t-test result for statistics test I

Variable Mean SD N df t- value Critical value p-value

Stat Test 6.77 1.092 9 8 4.86 2.30 0.001
Population mean = 5

*P < 0.05

This shows that there is a significant difference between the sample mean and the population
mean scores t (8) = 4.86, p < 0.05. This also implies that the sample mean score of stat test (M =
6.77) is significantly higher than the population mean score (M = 5) for students.

6.7.2 Independent Sample T- test

Basic concepts
An independent t-test measures differences between two distinct groups. Those differences might
be directly manipulated (e.g. drug treatment group vs. placebo group), they may be naturally
occurring (e.g. male vs. female), or they might be beyond the control of the experimenter (e.g.
depressed people vs. healthy people). In an independent t-test mean dependent variablescores are
compared between the two groups (the independent variable). For example, we couldmeasure
differences in the amount of money spent on clothes between men and women

55
The t test (unrelated) is based on comparing the means for the two groups doing each condition.
This is because there is no basis for comparing differences between related pairs of scores for
each participant. Because the t test (unrelated) is based on unrelated scores for two conditions,
which are independent of each other, another name for the t test (unrelated) is the independent t
test.

In many real life situations, we cannot determine the exact value of the population mean. We are
only interested in comparing two populations using a random sample from each. Such
experiments, where we are interested in detecting differences between the means of two
independent groups are called independent samples test. Some situations where independent
samples t-test can be used are given below:

 An economist wants to compare the per capita income of two different regions.
 A labor union wants to compare the productivity levels of workers for two different
groups.
 An aspiring MBA student wants to compare the salaries offered to the graduates of two
business schools.

In all the above examples, the purpose is to compare between two independent groups in contrast
to determining if the mean of the group exceeds a specific value as in the case of one sample t-
tests.

Assumptions

 The independent variable must be categorical

- It must consist of two distinct groups
- Group membership must be independent and exclusive
- No person (or case) can appear in more than one group
 There must be one parametric dependent variable
- The dependent variable data must be interval or ratio
- And should be reasonably normally distributed (across both groups)
 We should check for homogeneity of variances
 If these assumptions are not met the non-parametric Mann–Whitney U test could be
considered

56
Computing Independent Sample T- test

For example:

Gender differences in statistics test results

Male
1 2 3 4 5 6 7 8 9 10
students

Scores 4 6 5 7 8 4 3 2 4 5 48

X12 16 36 25 49 64 16 9 4 16 25 260

Female 11 12 13 14 14 15 15 16 17 18
students

Scores 8 9 6 7 8 10 8 9 7 10 82

X22 64 81 36 49 64 100 64 81 49 100 688

Steps for test statistic in

Step 1 State the null and alternative hypotheses.

❑
H0: mean 1=mean 2 (The two sample means are the same)
❑
H1: mean 1≠ mean 2 (the two means are different each other)
Step 2 Specify the level of significance = 0.05

Step 3 Determine the degrees of freedom = N - 2 = 20 -2 = 18

Step 4 Determine the critical value = from the table = 2.10

Step 5 Determine the rejection region – All values > 2.10

Step 6 Find the test statistic

57
Step 7 Make a decision to reject or fail to reject H0

Step 8 interpret the result

6.7. 3 Dependent (Paired) Samples t-test

In case of independent samples test for testing the difference between means, we assume that the
observations on one sample are not dependent on the other. However, this assumption limits the
scope of analysis as in many cases the study has to be done on the same set of elements (people,
objects etc.) to control some of the sample specific extraneous factors. Such experiments where
the observations are made on the same sample at two different times, is called dependent or
paired sample t-test. Some situations where dependent samples t-test can be used are given
below:

58
 The HR manager wants to know if a particular training program had any impact
in increasing the motivation level of the employees.
 The production manager wants to know if a new method of handling machines
helps in reducing the break down period.
 An educationist wants to know if interactive teaching helps students learn more
as compared to one-way lecturing.

One can compare these cases with the previous ones to observe the difference. The subjects in all
these cases are the same and observations are taken at two different times

Computing paired T-test

6.7.2 Steps for test statistic inpaired T-test

For example:

Maths 4 3 3 3 4 5 4 3 5 4 38
Civic 1 2 2 3 3 2 2 4 1 1 21
d 3 1 1 0 1 3 2 -1 4 3 17
d2 9 1 1 0 1 9 4 1 16 9 51
Student 1 2 3 4 5 6 7 8 9 10

Step 1 State the null and alternative hypotheses.

❑
H0: x 1 ¿ x 2 (mean1 is equal to mean2 – no difference between the two means)
❑
H1: x 1 ≠ x 2 (mean1 is different from the mean2)
Step 2 Specify the level of significance = 0.05,

Step 3 Determine the degrees of freedom = N – 1 = 10 – 1 = 9

Step 4 Determine the critical value = from the table = 2.26 (0.05)

Step 5 Determine the rejection region – All the values > 2.26

59
17 17 17
∑d =
Step 6 Find the test statistic t= =
√ =
√ √ 24.5 =
2

51−¿(17) 221
√ n ∑ d −¿ ¿ ¿ ¿ ¿
2
10 X
10−1
¿
9

3.43

Step 7 Make a decision to reject or fail to reject H0

 The calculated t-value is 3.43 > the critical value 2.26 at 0.05 significance level. Then,
H0 is rejected.

Step 8 interpret the result

Table 1: T-test results for students test scores

Variable Mean SD df t- value Critical p-value

value
Mathematic 3.8 .7888
s 9 3.43* 2.26 0.008
Civics 2.1 .9944
N = 10

*P < 0.05

This shows that there is a significant difference between mathematics and Civic mean scores t
(9) = 3.43, p < 0.05. This also implies that the mean of mathematics (M = 3.8) is significantly
higher than the mean score of civic education (M =2.1) for students

6.8Analysis Of Variance (ANOVA)

The analysis of variance (ANOVA) currently enjoys the status of being probably the most used
statistical technique in psychological research integrating with other tests of analysis such as
regression, multiple analysis of variance and covariance. Analysis of variance is highly related
with t- test in comparing means in the process of conducting psychological researches. The
popularity and usefulness of this technique can be attributed to two facts. First, the analysis of

60
variance, like t, deals with differences between sample means, but unlike t, it has no restriction
on the number of means. Instead of asking merely whether two means differ, we can ask whether
two, three, four, five, or k means differ. Second, the analysis of variance allows us to deal with
two or more independent variables simultaneously, asking not only about the individual effects
of each variable separately but also about the interacting effects of two or more variables
(Pagano, 2009).

Based on the number of the independent variables included in the research, there are different
forms of the analysis of variance such as one way analysis , two way, three way and so on. On
the other hand, considering the design, the nature of the dependent variable and the hypothesis to
be tested scholars categorized analysis of variance in to between group participants design,
repeated measure design and mixed design. In other words, one way analysis of variance used
one independent variable having three and more levels with one dependent variable ( Hiwett&
Crammer, 2011).

As a parametric test, analysis of variance is interested in testing the null hypothesis having one
continuously measured dependent variable with one or more categorical independent variables.
The independent variables are expected to have different levels that have organized scores
obtained from data gathering tools. Stating the null and alternative hypotheses in symbols and in
words and thereby calculating the F-ratio in accordance with the steps are important activities in
analysis of variance. If the F-ratio showed significant differences across the means, post hoc test
analysis can be done in order to know which mean is significantly different from the others. At
the same time, calculating the effect size of the independent variable on dependent variable using
different statistical techniques such as omega and eta squares is still impotant (Dancey&Reidey,
2011).

Logic of Analysis of Variance

Analysis of variance (ANOVA) is a method of testing the equality of three or more population
means by analyzing sample variances. Then, the logic using preferring One-way analysis of
variance instead of t-test is that

 it saves time, cost, effort

 Increase power of the test to detect real effect of the independent variable.

61
 Like t test, analysis of variance deals with differences between two sample means, but
unlike t test, it has no restriction on the number of means. Instead, we can ask whether
two, three, four, five, or k means differ.
 Analysis of variance allows us to deal with two or more independent variables
simultaneously, asking not only about the individual effects of each variable separately
but also about the interacting effects of two or more variables (Pagano, 2009).

One Way Analysis of Variance

 One-way analysis of varianceis a hypothesis-testing technique that is used to compare the

meanscores of three or more populations.
 One independent variable with its different levels is being studied that is why it is called
one-way analysis of variance (Larson and Faber, 2012).

Assumptions of analysis of variance (ANOVA)

According to Howell (2011) assumptions underlie all analysis of variance (ANOVA) using the F
statistic is organized below.

A. The Assumption of Normality

For reasons dealing with our final test of significance, we will make the assumption that scores
in each population should be normally distributed around the population mean. We made the
same assumption for t- test. Moreover, even substantial departures from normality may, under
certain conditions, have remarkably little influence on the final result.

B. The Assumption of Homogeneity of Variance

 A second major assumption that we will make is that each population of scores has the
same variance; specificallyσ 12 =σ 22=σ 32=σ 42.
 Homogeneity of variance would be expected to occur if the effect of a treatment is to add
a constant to everyone’s score. Under certain conditions this assumption also can be
relaxed without doing too much damage to the final result.

62
 In other words, the analysis of variance is robust with respect to violations of the
assumptions of normality and homogeneity of variance.

C. The Assumption of Independence of Observations

Our third important assumption is that the observations are all independent of one another. For
any two observations in an experimental treatment, we assume that knowing how one
observation stands relative to the treatment (or population) mean tells us nothing about the other
observation. This assumption is one of the important reasons why participants are usually
randomly assigned to groups. Violation of the independence assumption can have serious
consequences for an analysis.
D. The sample is simple random samples from the populations.

POPULATION 1 POPULATION 2 POPULATION 3 POPULATION 4

Sample 1 Sample 2 Sample 3

Sample 4

E. The different samples are from a populations that are categorized in only one way
These are expected to come one independent variable organized as levels. In other words, the
samples didn’t show the number of independent variables.

3.2 Sources of variance

Analysis of variance (ANOVA), as the name suggests, analyses the different sources from which
variation in the scores arises.

Between-groups variance

63
ANOVA looks for differences between the means of the groups. When the means are very
different, we say that there is a greater degree of variation between the conditions. If there were
no differences between the means of the groups, then there would be no variation. This sort of
variation is called between-groups variation (Dancey&Reidey, 2011).

Between-groups variation arises from:

Treatment effects: When we perform an experiment, or study, we are looking to see that the
differences between means are big enough to be important to us, and that the differences
reflect our experimental manipulation. The differences that reflect the experimental
manipulation are called the treatment effects
Individual differences: Each participant is different, therefore participants will respond
differently, even when faced with the same task. Although we might allot participants
randomly to different conditions, sometimes we might find, say, that there are more
motivated participants in one condition, or they are more practiced at that particular task.
Experimental error: Most experiments are not perfect. Sometimes experimenters fail to
give all participants the same instructions; sometimes the conditions under which the tasks
are performed are different, for each condition. At other times, equipment used in the
experiment might fail, etc. Differences due to errors such as these contribute to the
variability.
Within-groups variance
Another source of variance is the differences or variation within a group. This can be thought of
as variation within the columns.
Within-groups variation arises from:
Individual differences: In each condition, even though participants have been given the
same task, they will still differ in scores. This is because participants differ among
themselves in abilities, knowledge, IQ, personality and so on. Each group, or condition, is
bound to show variability.
Experimental error: This has been explained above
Steps for test statistic in One-Way ANOVA

Step 1 State the null and alternative hypotheses.

64
❑
H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal.)
Ha: At least one mean is different from the others
Step 2 Specify the level of significance = 0.05, 0.01, 0.1

Step 3 Determine the degrees of freedom = N - K, K - 1

Step 4 Determine the critical value = from the table

Step 5 Determine the rejection region

Step 6 Find the test statistic

Step 7 Make a decision to reject or fail to reject H0

Step 8 interpret the result

Example 1: A researcher wanted to test the effect of study skills support on academic
achievement scores of students in DeberMarkos University. Then, he took 15 students who need
study skills support and Assign them randomly in to three groups such as placebo, low support
and high support. Level of significance for this hypothesis testing is 0.05. The data ollected from
students are presented in the following table.

Placebo Low support High support

2 10 10
3 8 13
7 7 14
2 5 13
6 10 15

From the above Data

Mean1 = 4 Mean 2 = 8 Mean 2 = 8
Σx 1=20 Σx 2=4 Σx 2=4
Σx i=125 Σ x =1299 n3 = 5
2

n1 = 5 n2 = 5
N = 15

65
Solution:
Step1: State the null and alternative hypotheses
❑
H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal)
❑
Ha: μ1 ≠ μ2 ≠ μ 3 (At least one mean is different from the others)
Step 2: Specify the level of significance

The significance level is α = 0.05

Step 3 Determine the degrees of freedom

Degree of freedom for between groups = (K – 1) = 3 – 1 = 2 (K is number of groups)

Degree of freedom for within groups = (N – K) = 15 – 3 = 12 (K is number of groups)

Degree of freedom for total = (N – 1) = 15 – 1 = 14 (N is all participants in the research)

Step 4 Determine the critical value from F distribution.

To find F critical value we use F (2, 12) = 3.89

Step 5 Determine the rejection region

In the F distribution, the rejection region is all the values greater than 3.89. In other words, if F
calculated greater than 3.89 reject the null hypothesis because it is in the rejection region or, if F
calculated is less than 3.89 accept the null hypothesis.

Step 6 Find the test statistic

Calculate between sum of square (SSB)

SSB = ¿ ¿+ ¿ ¿ + ¿¿ - ¿¿
SSB = ¿ ¿+ ¿ ¿ + ¿¿ - ¿¿
SSB = (80 + 320 + 845) - 1041.667
SSB = 1245 - 1041.667 SSB = 203.333
Calculate within sum of square (SSW)

SSW = Σ x 2−¿ ¿+ ¿¿ + ¿¿
SSW = 1299 -¿ ¿+ ¿ ¿ + ¿¿
66
SSW = 1299 - (80 + 320 + 845)
SSW = 1299 –1245 SSW = 54
Calculate total sum of square (SST)
SST = SSB + SSW, SST = 203.333 + 54 SST = 257.333
Calculate Between groups mean of square (MSB)

❑ ❑
SSB 203.333
MSB = MSB= MSB=101.667
DFB 2
Calculate within groups mean of square (MSW)

❑ ❑
SSW 54
MSW = MSW = MSW =4.5
DFW 12
❑ ❑
MSB 101.667
Calculate F- ratio F= F= = 22.59
MSW 4.5
Step 7 Make a decision to reject or fail to reject H0

F critical (2, 12) = 3.89

F calculated = 22.59

ANOVA Summary table study skills support given for students

Sources of variation Degree of freedom Sum of squares Mean F

squares
Between groups 2 203.333 101.667
22.59
Within groups 12 54 4.5
Total 14 257.333

Then, F calculated = 22.59 > F critical (2, 12) = 3.89 reject the null hypothesis. This shows the
location of the rejection region and the test statistic. Therefore, F is in the rejection region, you
should to reject the null hypothesis

Interpretation:There is enough evidence at the 5% level of significance to conclude that study

skills support hassignificant effect on the means of academic achievement scores of students.

67
3.3 Post hoc Analysis

Post hoc analysis is a multiple comparison techniques for making comparisons between two or
more group means subsequent to an analysis of variance. Since there is enough evidence at the
5% level of significance to conclude that the means of academic achievement scores of students
are different. Then, which mean is different from the others can be known through post hoc
analysis. Post hoc analysis methods are different in their power minimizing type I error. Some of
them are listed below.
Let’s use the Post hoc analysis technique of Tukey test for the example given above.

When Tukey test is used for the post hoc analysis we use Q-distribution to find the critical
value .Then, the multiple comparisons through Tukey test has four steps done as follows.

Step1: Find Q-calculated by comparing two means at a time

1. Placebo with low study skills support (mean1 with mean2)

❑ ❑
mean2 – mean1 8–4
Q-cal =
√ MSW ❑
n √
= 4.5❑
5
= 4.5*

2. Placebo with high study skills support (mean1 with mean3)

❑ ❑
mean3 – mean1 13 – 4
Q-cal =
√ MSW ❑
n √
= 4.5❑
5
= 9.48*

3. low study skills support with high study skills support (mean2 with mean3)
❑ ❑
mean3 – mean2 13 – 8
4. Q-cal =
√ MSW ❑
n √
= 4.5❑
5
= 5.27*

Step 2: Find Q-critical from Q-distribution by (r, df) - Q (5, 12) = 3.77

Step3: Make decision based on the three mean comparisons.

Therefore, for mean1 and mean2Q-cal>Q-cri or 4.5 > 3.77 reject the null hypothesis

For mean1 and mean3Q-cal>Q-cri or 9.48 > 3.77 reject the null hypothesis

For mean3 and mean2Q-cal>Q-cri or 5.27 > 3.77 reject the null hypothesis

68
Step 4: Interpretation

There is enough evidence at the 5% level of significance to conclude that study skills support all
means of academic achievement scores of students are significantly different each other.

LINEAR AND MULTIPLE REGRESSION ANALYSIS

Introduction

Regression analysis is a statistical technique that is widely used for research. Regression analysis is used
to predict the behavior of the dependent variables, based on the set of independent variables. In regression
analysis, dependent variables can be metric or non-metric and the independent variable can be metric,
categorical, or both a combination of metric and categorical. These days, researchers are using regression
analysis in two manners, for linear regression analysis and for non-linear regression analysis. Linear
regression analysis is further divided into two types, simple linear regression analysis and multiple linear
regression analysis. In simple linear regression analysis, there is a dependent variable and an independent
variable. In multiple linear regressions analysis, there is a dependent variable and many independent
variables. Non- linear regression analysis is also of two types, simple non-linear regression analysis and
multiple non-linear regression analysis. When there is a non-liner relationship between the dependent and
independent variables and there is a dependent and an independent variable, then it said to be simple non-
liner regression analysis. When there is a dependent variable and two or more than two independent
variables, then it said to be multiple non-linear regression.

Learning outcomes
Upon completing this topic, the students will be able to:
 Describe basic concepts of regression
 Appropriately use regression principles in different research fields
 Apply regression models in research design
 Perform regression analysis and interpret the results

Key Terms: Regression, Intercept, Slope, Curve it, Polynomial, Best fit line

3.1. Linear regression

Linear regression is the most basic and commonly used predictive analysis. Regression estimates are used
to describe data and to explain the relationship between one dependent variable and one or more
independent variables.

At the center of the regression analysis is the task of fitting a single line through a scatter plot. The
simplest form with one dependent and one independent variable is defined by the formula y = a + b*x.

69
Sometimes the dependent variable is also called endogenous variable, prognostic variable or regressand.
The independent variables are also called exogenous variables, predictor variables or regressors.

However Linear Regression Analysis consists of more than just fitting a linear line through a cloud of
data points. It consists of 3 stages – (1) analyzing the correlation and directionality of the data, (2)
estimating the model, i.e., fitting the line, and (3) evaluating the validity and usefulness of the model.

Uses of Linear Regression Analysis

1). Might be used to identify the strength of the effect that the independent variable(s) have on a
dependent variable. Typical questions are what is the strength of relationship between dose and effect,
sales and marketing spend, age and income.

2). It can be used to forecast effects or impacts of changes. That is regression analysis helps us to
understand how much will the dependent variable change, when we change one or more independent
variables. Typical questions are how much additional Y do I get for one additional unit X.

3). Regression analysis predicts trends and future values. The regression analysis can be used to get point
estimates. Typical questions are what will the price for gold be in 6 month from now? What is the total
effort for a task X?

Assumptions:

1. There is normal distribution.

2. There is a linear relationship between the dependent and independent variable.
3. There is no multicollinearity between the independent variables or no exact correlation between the
independent variable.
4. There is no autocorrelation.
5. The means lagged value of the regression variable does not affect the current value.
6. The homoscedasticity or variance between all the independent variables is equal.

Simple linear regression is a measure of linear association that investigates straight-line relationships
between a continuous dependent variable and an independent variable. It is explained best through
regression equation

3.2. The Regression Equation

The Regression Equation (Y = α + βX )

Y = the continuous dependent variable
X = the independent variable (can be a categorical dummy variable)
α= the Y intercept (regression line intercepts Y axis)
β = the slope of the coefficient (rise over run)
Parameter Estimate Choices

70
βis estimated coefficient of the strength and direction of the relationship between the
independent (IV) and dependent variable (DV).
α (Y intercept) is a fixed point that is considered a constant (how much Y can exist without X)
Standardized Regression Coefficient (β)
Estimated coefficient of the strength of relationship between the IV and DV variables.
Expressed on a standardized scale where higher absolute values indicate stronger
relationships (Scale ranges is from -1 to 1).
Parameter Estimate Choices
Raw regression estimates (b1)
Raw regression weights have the advantage of retaining the scale metric—which is
also their key disadvantage.
If the purpose of the regression analysis is forecasting, then raw parameter estimates
must be used. The researcher is interested only in prediction.
Standardized regression estimates (β1)
Standardized regression estimates have the advantage of a constant scale.
Standardized regression estimates should be used when the researcher is testing
explanatory hypotheses
3.3. Predictive Methods

With the exception of the mean and standard deviation, linear regression is possibly the most widely
used of statistical techniques. This because any of the problems that we encounter in research settings
require that we quantitatively evaluate the relationship between two variables for predictive purposes.
By predictive, I mean that the values of one variable depend on the values of a second. We might be
interested in calibrating an instrument such as a sprayer pump. We can easily measure the current or
voltage that the pump draws, but specifically want to know how much fluid it pumps at a given
operating level. Or we may want to empirically determine the production rate of a chemical product
given specified levels of reactants.

Linear regression, which is the natural extension of correlation analysis, provides a great starting
point toward these objectives.

Terms for predictive analysis:

Curve fit - This is perhaps the most general term for describing a predictive relationship between two
variables, because the "curve" that describes the two variables is of unspecified form.
Polynomial fit - A polynomial fit describes the relationship between two variables as a mathematical
series. Thus a first order polynomial fit (a linear regression) is defined as y = a + bx. A second order
(parabolic) fit is y= a + bx + cx^2, a third order (spline) fit is y = a + bx + cx^2 + dx^3, and so on...
Best fit line - The equation that best describes the y or dependent variable as a function of the x or
independent.
Linear regression and least squares linear regression - This is the method of interest. The
objective of linear regression analysis is to find the line that minimizes the sum of squared deviations
of the dependent variable about the "best fit" line. Because the method is based on least squares, it is
said to be a BLUE method, a Best Linear Unbiased Estimator.

71
6.1.2. Defining the Regression Model
We've already stated that the general form of the generalized linear regression is: y= a + bx. The
coefficient "a" is a constant called the y-intercept of the regression. The coefficient "b" is called the
"slope" of the regression. It describes the amount of change in y that corresponds to a given change in
x.

The slope of the linear regression can be calculated in a number of ways:

Specifically, the slope is defined as the summed cross product of the deviations of x and y from their
respective means, divided by the sum of squares of the deviations x from it's mean. The second
relationship above is useful if these quantities have to be calculated by hand. The standard error
values of the slope and intercept can are mainly used to compute the 95% confidence intervals. If you
accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of
the slope contains the true value of the slope, and that the 95% confidence interval for the intercept
contains the true value of the intercept.

It's interesting to note that the slope in the generalized case is equal to the linear correlation
coefficient scaled by the ratio of the standard deviations of y and x:

72
This explicitly defines the relationship between linear correlation analysis and linear regression.
Notice that in the case of standardized regression, where sy and sx = 1,

From this
definition, it
should be clear
that the best fit line passes through the mean values for x and y.

Assumptions

There are several assumptions that must be met for the linear regression to be valid:

• The random variables must both be normally distributed (bivariate normal)

and linearly related.
• The x values (independent variable) must be free of error.
• The variance of y (the dependent variable) as a function of x must be
constant. This is referred to as homoscedasticity.

6.1.3. Evaluating the Model Fit

The scatter of the y values about y estimates (denoted yhat) based on the best fit line is often referred
to as the "standard error of the regression":

Notice that two degrees of freedom are lost in the denominator: one for the slope and one for the
intercept. A more descriptive definition - and strictly correct name - for this statistic is the root mean
square error (denoted RMS or RMSE).

73
How much variance is explained?

Just as in linear correlation analysis, we can explicitly calculate the variance explained by the regression
model:

You should recognize

this definition as
identical to the one
used in correlation
analysis. This
relationship can
also be written in terms of the z-scores of x and y.

Determining statistical significance

As with the other statistics that we have studied the slope and intercept are sample statistics based on data
that includes some random error, e: y + e = a + b x. We are of course actually interested in the true
population parameters which are defined without error. y = a + b x. How do we assess the significance
level of the model? In essence we want to test the null hypothesis that b=0 against one of three possible
alternative hypotheses: b>0, b<0, or b not = 0.

There are at least two ways to determine the significance level of the linear model. Perhaps the easiest
method is to calculate r, and then determine significance based on the value of r and the degrees of
freedom using a table for significance of the linear or product moment correlation coefficient. This
method is particularly useful in the standardized regression case when b=r.

The significance level of b, can also be determined by calculating a confidence interval for the slope. Just
as we did in earlier hypothesis testing examples, we determine a critical t-value based on the correct
number of degrees of freedom and the desired level of significance. It is for this reason that the random
variables x and y must be bivariate normal.

For the linear regression model the appropriate degrees of freedom is always df=n-2. The level of
significance of the regression model is determined by the user, the 95% or 99% levels are generally
used. The standard error values of the slope and intercept can be hard to interpret, but their main purpose
is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a
95% chance that the 95% confidence interval of the slope contains the true value of the slope, and that the
95% confidence interval for the intercept contains the true value of the intercept.

The confidence interval is then defined as the product of the critical t-value and Sb, the standard error
of the slope:

whereSb is
defined as:

74
Interpretation.

If there is a significant slope, then b will be statistically different from zero. So if b is greater than (t-
crit)*Sb, the confidence interval does not include zero. We would thus reject the null hypothesis that b=0
at the pre-determined significance level. As (t-crit)*Sbbecomes smaller, the greater our certainty in beta,
and the more accurate the prediction of the model.

If we plot the confidence interval on the slope, then positive and negative limits of the confidence interval
of the slope plot as lines that intersect at the point defined by the mean x,y pair for the data set. In effect,
this tends to underestimate the error associated with the regression equation because it neglects the role of
the intercept in controlling the position of the line in the cartesian plane defined by the data. Fortunately,
we can take this into account by calculating a confidence interval on line.

3.4. Confidence Interval for the regression line

Just as we did in the case for the confidence interval on the slope, we can write this out explicitly as a
confidence interval for the regression line, that is defined as follows:

The degrees of
freedom is still df= n-2,
but now the standard
error of the
regression line
is defined as:

Because values that are

further from the mean
of x and y have less
probability and thus greater uncertainty, this confidence interval is narrowest near the location of the joint
x and y mean (the centroid or center of the data distribution), and flares out at points further from
centroid. While the confidence interval is curvilinear, the model is in fact linear.

Introducing Statistics Part 1
No ratings yet
Introducing Statistics Part 1
22 pages
RTU Specification for SCADA Systems
100% (1)
RTU Specification for SCADA Systems
18 pages
Nature of Statistics
100% (1)
Nature of Statistics
7 pages
NURS 201 - Week 3 Epidemiology and Research Terms Part II
No ratings yet
NURS 201 - Week 3 Epidemiology and Research Terms Part II
25 pages
A Revised Guideline For Curriculum Modularization in Ethiopian 1 Higher Education Institutions
100% (1)
A Revised Guideline For Curriculum Modularization in Ethiopian 1 Higher Education Institutions
58 pages
Lecture Note I
No ratings yet
Lecture Note I
19 pages
Constructive and Destructive Feedback Notes
No ratings yet
Constructive and Destructive Feedback Notes
5 pages
Basics of Statistics Overview
100% (1)
Basics of Statistics Overview
83 pages
Inbound 4581472665919544689
No ratings yet
Inbound 4581472665919544689
38 pages
Inbound 6102089472106320430
No ratings yet
Inbound 6102089472106320430
41 pages
Math in A Modern World
No ratings yet
Math in A Modern World
4 pages
Basics of Statistics
No ratings yet
Basics of Statistics
83 pages
Basics of Biostatistics ALL
No ratings yet
Basics of Biostatistics ALL
456 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
99 pages
Introduction To Biostatistics: Dr. M. H. Rahbar
No ratings yet
Introduction To Biostatistics: Dr. M. H. Rahbar
35 pages
Design of Radial Gate Using Rectangular 2
100% (1)
Design of Radial Gate Using Rectangular 2
55 pages
Nature of Statistics
No ratings yet
Nature of Statistics
7 pages
Generator Spare Parts Budget-2020
No ratings yet
Generator Spare Parts Budget-2020
106 pages
Cheating in Exam and Prevention Techniques
No ratings yet
Cheating in Exam and Prevention Techniques
19 pages
Educ3063 Notes
No ratings yet
Educ3063 Notes
52 pages
Basic Stat PDF
No ratings yet
Basic Stat PDF
52 pages
1999-2000 SUSPENSION Front - Avalon, Camry, Camry Solara, Celica, Corolla, Echo, RAV4 & SiennaFront Suspension
75% (4)
1999-2000 SUSPENSION Front - Avalon, Camry, Camry Solara, Celica, Corolla, Echo, RAV4 & SiennaFront Suspension
22 pages
Math As A Tool Data Management Introduction and Central Tendency
No ratings yet
Math As A Tool Data Management Introduction and Central Tendency
12 pages
Ghon Stat Chapter1
No ratings yet
Ghon Stat Chapter1
39 pages
STA132 Complete Note
No ratings yet
STA132 Complete Note
110 pages
'MATH 233 Statistics For Social Sciences - Week 1' D - 241029 - 161224
No ratings yet
'MATH 233 Statistics For Social Sciences - Week 1' D - 241029 - 161224
110 pages
Statistics
No ratings yet
Statistics
20 pages
MMW 6 11
No ratings yet
MMW 6 11
48 pages
Intro 123243 Ewqs 1
No ratings yet
Intro 123243 Ewqs 1
37 pages
Notes Chapter 1 What Is Statistics
No ratings yet
Notes Chapter 1 What Is Statistics
27 pages
Chapter One Definition of Statistics
No ratings yet
Chapter One Definition of Statistics
17 pages
Q and QA (Assessors Training)
No ratings yet
Q and QA (Assessors Training)
69 pages
Statistics Lec 1
No ratings yet
Statistics Lec 1
21 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
44 pages
Introduction to Biostatistics Basics
No ratings yet
Introduction to Biostatistics Basics
29 pages
Intro To Biostatistics Lecture BSMLS 3-A&B
No ratings yet
Intro To Biostatistics Lecture BSMLS 3-A&B
74 pages
Lecture No 01 Statistics 13-2-24
No ratings yet
Lecture No 01 Statistics 13-2-24
34 pages
Introduction To Statistics-1
No ratings yet
Introduction To Statistics-1
102 pages
Lesson 1 Intro To Statistics
No ratings yet
Lesson 1 Intro To Statistics
3 pages
Topic 1 ELEMENTARY STATISTICS
No ratings yet
Topic 1 ELEMENTARY STATISTICS
29 pages
Lecture-1 Introduction To Statistical Theory
No ratings yet
Lecture-1 Introduction To Statistical Theory
83 pages
Chapter1 Introduction To Statistics
No ratings yet
Chapter1 Introduction To Statistics
27 pages
Note For Int To Statistics
No ratings yet
Note For Int To Statistics
24 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
10 pages
Rosalie Act. 2.0
No ratings yet
Rosalie Act. 2.0
9 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
Introduction to Statistics Guide
No ratings yet
Introduction to Statistics Guide
81 pages
Basic Statistics Sakina
No ratings yet
Basic Statistics Sakina
16 pages
Problem Solving
No ratings yet
Problem Solving
16 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
60 pages
Module in Advanced Statistics Revised
No ratings yet
Module in Advanced Statistics Revised
67 pages
Chapter 1 BKU2032
No ratings yet
Chapter 1 BKU2032
57 pages
BSC - Microbiology - Sem - 1 (Minor With Practicals)
No ratings yet
BSC - Microbiology - Sem - 1 (Minor With Practicals)
3 pages
Chapter 1
No ratings yet
Chapter 1
9 pages
Lesson Plan For Sounds
No ratings yet
Lesson Plan For Sounds
27 pages
Introduction to Statistics: Key Concepts
No ratings yet
Introduction to Statistics: Key Concepts
18 pages
Estimating & Measuring Work Within A Construction Environment
No ratings yet
Estimating & Measuring Work Within A Construction Environment
29 pages
Hand-Out in Statistics Statistics
No ratings yet
Hand-Out in Statistics Statistics
4 pages
Statistics and Probability A Brief History of Statistics
No ratings yet
Statistics and Probability A Brief History of Statistics
42 pages
Prelim Lesson 1-3
No ratings yet
Prelim Lesson 1-3
50 pages
6 - Designing Qualitative Research
100% (1)
6 - Designing Qualitative Research
39 pages
Examination and LO
No ratings yet
Examination and LO
36 pages
30GX
0% (1)
30GX
12 pages
Law Firm Questions
No ratings yet
Law Firm Questions
5 pages
Course Sylabus For Philosophy of Education
No ratings yet
Course Sylabus For Philosophy of Education
12 pages
4JH1 Gestión Electrónica
No ratings yet
4JH1 Gestión Electrónica
79 pages
Ayalew Shibeshi
No ratings yet
Ayalew Shibeshi
223 pages
Detyre Kursi Rrjeta Telematike
No ratings yet
Detyre Kursi Rrjeta Telematike
19 pages
Dorothy Allison
No ratings yet
Dorothy Allison
2 pages
SKF3013 - Manual Amali PDF
No ratings yet
SKF3013 - Manual Amali PDF
26 pages
PGDT Practicum Work Book
No ratings yet
PGDT Practicum Work Book
31 pages
Clinical Research Cover Letter
100% (2)
Clinical Research Cover Letter
6 pages
1 Action Research Final
No ratings yet
1 Action Research Final
75 pages
Life Skill Manual Too Best
No ratings yet
Life Skill Manual Too Best
61 pages
Oscor Blue
No ratings yet
Oscor Blue
6 pages
Prac
No ratings yet
Prac
9 pages
Project
No ratings yet
Project
9 pages
01 History of Philippine Architecture
No ratings yet
01 History of Philippine Architecture
18 pages
Global A Ization
No ratings yet
Global A Ization
20 pages
Curriculum Evaluation Tool
No ratings yet
Curriculum Evaluation Tool
6 pages
AAManagement of Educational Change
No ratings yet
AAManagement of Educational Change
146 pages
Reflective Note
No ratings yet
Reflective Note
20 pages
Cooperativelearning
No ratings yet
Cooperativelearning
3 pages
MiniROVER Data Sheet 2013 Lo 1
No ratings yet
MiniROVER Data Sheet 2013 Lo 1
2 pages
Sika® ViscoCrete®-TS 100-2
0% (1)
Sika® ViscoCrete®-TS 100-2
3 pages
Draft Action Plan
No ratings yet
Draft Action Plan
2 pages
Hydratight Sweeny RSL
No ratings yet
Hydratight Sweeny RSL
1 page
Objective Classification
No ratings yet
Objective Classification
3 pages
EMIS Note For Students Best
No ratings yet
EMIS Note For Students Best
13 pages
Review Packet #2 - Polynomials: FX X X
No ratings yet
Review Packet #2 - Polynomials: FX X X
4 pages
Assignment # 1,2 - HE
No ratings yet
Assignment # 1,2 - HE
8 pages
Exercise 40
No ratings yet
Exercise 40
5 pages
Irc 096-1987
No ratings yet
Irc 096-1987
9 pages
6EP1332-1SH31 - Industry Support Siemens
No ratings yet
6EP1332-1SH31 - Industry Support Siemens
3 pages
Disbursement Voucher
No ratings yet
Disbursement Voucher
1 page
Surat Undangan Peserta ADIA
No ratings yet
Surat Undangan Peserta ADIA
9 pages
Experiment No.4 Atterberg Limits: Object
No ratings yet
Experiment No.4 Atterberg Limits: Object
3 pages
Zishan Z3 User Manual
No ratings yet
Zishan Z3 User Manual
3 pages
Audit of Shareholder's Equity (Roque) PDF
No ratings yet
Audit of Shareholder's Equity (Roque) PDF
1 page