An Introduction and Overview
Numerical representations of our data
Can be:
Descriptive statistics summarize data.
Inferential statistics are tools that indicate how
much confidence we can have when we generalize
from a sample to a population.
Statistics depend on our sampling methods:
Probability or Non-probability? (i.e. Random or
not?)
Even with probability samples, there is a
possibility that the statistics we obtain do not
accurately reflect the population.
Sampling Error
Inadequate sampling frame, low response rate,
coverage (some people in population not given a
chance of selection)
Non-Sampling Error
Problems with transcribing and coding data;
observer/ instrument error; misrepresenation as
error.
Levels of Measurement – the relationship
among the values that are assigned to a
variable and the attributes of that variable.
Nominal- naming
Ordinal- rank order (high to low but no
indication of how much higher or lower one
subject is to another)
Interval- equal intervals between values
Ratio- equal intervals AND an absolute zero
(i.e. a ruler)
Age: under 30, 30-39, 40-49, 50-59
Gender: Male, Female
Level of Agreement: Strongly Agree, Agree,
Neutral, Disagree, Strongly Disagree
Percentage of the library budget spent on staff
salaries.
Descriptive Comparative
objectives/ research objectives/
questions: hypotheses
Descriptive statistics Inferential Statistics
Can be applied to any measurements
(quantitative or qualitative)
Offers a summary/ overview/ description of
data. Does not explain or interpret.
Number Variability
Frequency Count Variance and
Percentage standard deviation
Deciles and quartiles Graphs
Measures of Central Normal Curve
Tendency (Mean,
Midpoint, Mode)
Averages
Mode: most frequently occurring value in a
distribution (any scale, most unstable)
Median: midpoint in the distribution below which
half of the cases reside (ordinal and above)
Mean: arithmetic average- the sum of all values in a
distribution divided by the number of cases (interval
or ratio)
Example (11 test scores)
61, 61, 72, 77, 80, 81, 82, 85, 89, 90, 92
The median is 81 (half of the scores fall above 81,
and half below)
Example (6 scores)
3, 3, 7, 10, 12, 15
Even number of scores= Median is half-way
between these scores
Sum the middle scores (7+10=17) and divide by 2
17/2= 8.5
Insensitive to extremes
3, 3, 7, 10, 12, 15, 200
Mean is half the sum of a set of values:
Scores: 5, 6, 7, 10, 12, 15
Sum: 55
Number of scores: 6
Computation of Mean: 55/6= 9.17
Mode is the most frequently occurring value in
a set.
Best used for nominal data.
Skewed to the right (positive) or left (negative)
An extremely hard test that results in a lot of
low grades will be skewed to the right:
the mode is smaller than the median, which is
smaller than the mean. This relationship exists
because the mode is the point on the x-axis
corresponding to the highest point, that is the
score with greatest value, or frequency. The
median is the point on the x-axis that cuts the
distribution in half, such that 50% of the area
falls on each side.
An extremely easy test will result in a lot of
high grades, and will skew to the left (negative)
The order of the measures of central tendency
would be the opposite of the positively skewed
distribution, with the mean being smaller than
the median, which is smaller than the mode.
Variability is the differences among scores-
shows how subjects vary:
Dispersion: extent of scatter around the “average”
Range: highest and lowest scores in a distribution
Variance and standard deviation: spread of scores in
a distribution. The greater the scatter, the larger the
variance
Interval or ration level data
Standard deviation: how much subjects differ
from the mean of their group
Measures how much subjects differ from the
mean of their group
The more spread out the subjects are around
the mean, the larger the standard deviation
Sensitive to extremes or “outliers”
Allows for comparisons across variables
i.e. is there a relation between one’s occupation and
their reason for using the public library?
Hypothesis Testing
The level of significance is the predetermined
level at which a null hypothesis is not
supported. The most common level is p < .05
P =probability
< = less than (> = more than)
Type I error Type II error
Reject the null Fail to reject the null
hypothesis when it is hypothesis when it is
really true really false
By using inferential statistics to make decisions,
we can report the probability that we have
made a Type I error (indicated by the p value
we report)
By reporting the p value, we alert readers to
the odds that we were incorrect when we
decided to reject the null hypothesis
Chi-square test of independence: two variables
(nominal and nominal, nominal and ordinal, or
ordinal and ordinal)
Affected by number of cells, number of cases
2-tailed distribution= null hypothesis
1-tailed distribution= directional hypothesis
Correlation—the extent to which two variables
are related across a group of subjects
Pearson r
It can range from -1.00 to 1.00
-1.00 is a perfect inverse relationship—the strongest possible
inverse relationship
0.00 indicates the complete absence of a relationship
1.00 is a perfect positive relationship—the strongest possible
direct relationship
The closer a value is to 0.00, the weaker the relationship
The closer a value is to -1.00 or +1.00, the stronger it is
Spearman rho
t-test
Test the difference between two sample means for
significance
pretest to posttest
Relates to research design
Perhaps used for information literacy instruction
Analysis of variance
Regression analysis (including step-wise
regression)
Analysis of variance (ANOVA) tests the
difference(s) among two or more means
It can be used to test the difference between
two means
So use t-test or ANOVA?
KEY: ANOVA also can be used to test the
difference among more than two means in a
single test—which cannot be done with a t test
Parametric statistical tests generally require
interval or ratio level data and assume that the
scores were drawn from a normally distributed
population or that both sets of scores were
drawn from populations with the same
variance or spread of scores
Nonparametric methods do not make
assumptions about the shape of the population
distribution. These are typically less powerful
and often need large samples