Kenya Medical Training
College
Quantitative Software Lab
Dr Peter Samuels
Birmingham City University
27th January 2024
Training summary
❑ Introductory teaching
❑ Briefing on training
❑ Use instructions and videos to follow
through the activities (with blended
learning support)
❑ Final plenary discussion
Content Summary
❑ Part A. Brief introduction to inferential
statistics
❑ Part B. Correlation and regression:
➢ Excel analysis
➢ SPSS analysis
❑ Part C. Chi-squared test:
➢ Excel analysis
➢ SPSS analysis
❑ Exercises
Part A. The data analysis
process
Descriptive statistics
Test selection
Assumption
checking
Pass Fail
Parametric testing Nonparametric testing
What is parametric statistics?
❑ A form of statistical testing (inference) which assumes
data comes from a distribution (defined by parameters)
❑ Often this is the normal distribution, whose parameters
are a mean and a standard deviation
❑ We usually need to
Normal A significant
test for this first distribution with event is one
❑ Normally only use with mean 0 and lying in one of
standard the tails of this
scale data deviation 1 distribution
❑ If a parametric test
can’t be used then the
nonparametric
equivalent should be
used (where available)
Nonparametric statistics
❑ A type of statistical inference which does not
make any assumptions about the data coming
from a distribution
❑ Often applies to category-based data (nominal
and ordinal) but can also apply to scale-based
data if test assumptions are not met
❑ Advantage: no need to check assumptions
❑ Disadvantages:
➢ Results are generally less sensitive (higher sig-
values)
➢ Cannot handle more complex data structures (such
as 2-way ANOVA)
Interpreting scatter plots
Data values
meet the
assumptions for
a linear
parametric
analysis
provided that
the dispersion
is
approximately
‘cigar shaped’
(ellipse with a
rotated axis)
Some things which can go
wrong with these assumptions
Non-linear Bunched in one corner
Pear-shaped No relationship
Correlation and regression
assumptions: summary
❑ Linear relationship needed between the two
variables
❑ Variation in the data should be unrelated to the
value of the independent variable (i.e. not ‘pear’
shaped)
❑ Data should not be bunched up at one end for
either variable
❑ Watch out for influential outliers
❑ Parametric tests are more sensitive to violations
with larger data sets – smaller data sets are more
forgiving
Recap: Statistical decision
making is like a courtroom
❑ You are the judge
❑ Your data is on trial
❑ The assumption of
innocence is called
the null hypothesis
(often written H0)
❑ Only reject the
assumption of
innocence (i.e. convict
your data of guilt) if
there is evidence beyond reasonable doubt (sufficient
evidence to reject the null hypothesis)
The null and alternative
hypotheses
❑ Statistical testing is about making a decision about the
significance of a data feature or summary statistic.
❑ We usually assume that this was just a random event
then seek to measure how unlikely such an event was.
❑ The statement of this position is known as the null
hypothesis and is written H0.
❑ In statistical testing we make a decision about whether
to accept or reject the null hypothesis based on the
probability (or ‘P-’) value of the test statistic.
❑ The logical opposite of the null hypothesis is known as
the alternative hypothesis.
Standard significance levels
Johnny Depp Greta Thunberg:
Mo Farah
(Amber Heard) Climate change
Harvey Weinstein Albert Einstein: Gravitational waves
Standard significance levels
and the null hypothesis (H0)
P-value of Formal Informal
Significant? Example
test statistic action interpretation
No evidence to
> 0.1 No Retain H0 Mo Farah
reject H0
Weak evidence to Johnny
0.05 to 0.1 No Retain H0
reject H0 Depp
Yes, at the Evidence to reject Climate
0.01 to 0.05 Reject H0
0.05 level H0 change
Yes, at the Strong evidence to Harvey
0.001 to 0.01 Reject H0
0.01 level reject H0 Weinstein
Yes, at the Very strong Gravitational
< 0.001 Reject H0
0.001 level evidence to reject H0 waves
Null and alternative
hypotheses
Alternative
Example Null hypothesis hypothesis
He is not a drugs
Mo Farah He is a drugs cheat
cheat
Johnny Depp He is not a wife beater He is a wife beater
Human activity is not Human activity is the
Climate change the major recent factor major recent factor
affecting the climate affecting the climate
He did not sexually He did sexually abuse
Harvey Weinstein
abuse actresses actresses
Gravitational waves They don’t exist They do exist
Simple test selection overview
Type of
analysis Comparison
Association of two
of two groups
variables
Few values,
non-monotonic
Para- association, or
nominal Different Repeated
metric
groups measures
Non- Few Non- Para-
parametric values Para- parametric metric
/ordinal metric /ordinal
Pearson Spearman Chi- Indep- Mann- Wilcoxon Paired
correlation correlation squared endent Whitney test t-test
and simple and t-test U test
linear Fisher’s
regression exact
Pearson’s correlation and
linear regression
Pearson’s correlation:
❑ Parametric test for an association between two
variables
❑ Provides a correlation coefficient (r) between -1 and
+1 and a probability value
Linear regression:
❑ Find the best (least squares) fit for a linear
approximation to the relationship between two
variables (with the same probability value for the
slope of the fitted line as the correlation analysis)
Correlation coefficient: examples
❑ The sign of
the
correlation
coefficient
depends on
the direction
of the fitted
line (not its
slope)
❑ The size of
the correl-
ation coefficient depends on how closely the data points
are aligned (all on a line means the coefficient is +1 or -1)
Interpretation of r
Absolute value of r Interpretation
0 < |r| ≤ 0.1 Very small correlation
0.1 < |r| ≤ 0.3 Small correlation
0.3 < |r| ≤ 0.5 Medium correlation
0.5 < |r| ≤ 1 Large correlation
However, The interpretation of r also depends on the
context (e.g. physical science v. social science)
Reference:
Cohen, J., Cohen, P., West, S. G. and Aiken, L. S. (2003)
Applied Multiple Regression/Correlation Analysis for the
Behavioral Sciences. 3rd edn. Mahwah, NJ: Lawrence
Erlbaum.
Example: Beta endorphin
levels before and after a race
❑ There seems to
be a weak
positive linear
relationship
between these
two data series,
but is it
significant?
❑ Data appears
more spread for
lower values of
Before
❑ It is possible to use Pearson correlation
because of the small sample size
Pearson correlation analysis
❑ Null hypothesis: there is
no correlation between Output from SPSS:
Before and After
❑ Correlation coefficient is
0.515
❑ Not significant at the 0.05
level
❑ Interpretation: No
evidence of a linear
correlation (although very
near to being weak
evidence, in which case
the value of r could be
interpreted as ‘large’)
Linear regression analysis
❑ Gives a constant value of 17.282 and a slope value
of 1.173
❑ However, the slope is not significant at 0.05 (same
sig. value as the correlation coefficient), so there is
no evidence of a linear relationship
Fitted regression line
y = 17.282 + 1.173x
❑ Shown for illustration purposes only
❑ More data points might have given us a significant result
Chi-square test image: dropping
pebbles into a grid with known
row and column totals
How likely
or unlikely is
the pattern
obtained?
Example: Staff absences by
gender
Gender
Female Male Total
❑ This is a typical
Low 36 8 44
pattern: Retain H0 AbsCat
High 21 6 27
Total 57 14 71
❑ But this is an
unlikely pattern: Gender
Reject H0 Female Male Total
Low 30 14 44
AbsCat
High 27 0 27
Total 57 14 71
Chi-squared test and Fisher’s
exact test
❑ Nonparametric tests of association between two
categorical variables:
➢ A weaker relationship than correlation
➢ Scale data can be turned into intervals if necessary
❑ Both test measure observed frequency counts
against expected counts calculated from row and
column totals
❑ The Chi-squared test has certain restrictions for it to
be valid
❑ The Fisher’s exact test only works with 22 tables.
It complements the Chi-squared result. It can also
be used with smaller data sets where the Chi-
squared test is not valid.
Example: Favourite food type
and gender
There appears to be an association between these
variables, but is it significant?
Result of test
The test is valid
because the
percentage in
note a. is <20%
and the minimum
expected count >1
❑ Sig. value for the Pearson Chi-Square is 0.000, meaning
< 0.001
❑ Reject the null hypothesis with 99.9% confidence
❑ Very strong evidence that favourite food type is
associated with gender
Part B. How the lab works
❑ Log on to a computer
❑ Open today’s Google Drive folder in an Internet
browser
❑ Download the Excel or SPSS data files as instructed
❑ Open the slides file on one side of the screen (or use
a separate device to read the instructions, or print
them out) or open the videos
❑ Follow the instructions or watch and copy the videos
❑ Open Excel or SPSS on the other side of the screen
or on your computer
❑ Compare your output and charts with those in the
slides or the examples provided on this the Google
Drive folder
Correlation and Household
1
Income
25
Insurance
90
regression 2
3
40
60
165
220
Example 1: life 4 30 145
insurance and
5 29 114
6 41 175
income
7 37 145
8 46 192
A life insurance company wishes 9 105 395
to examine the relationship 10 81 339
11 57 230
between the amount of life 12 72 262
insurance held by a family and the 13 140 570
family income. A random sample 14 23 100
of 20 households was taken and 15 55 210
the values for the amount of life 16 58 243
insurance held and annual family 17 87 335
18 72 299
income (in £000s) were recorded. 19 80 305
20 48 205
Research question
Is there a linear association between
household income and household insurance
level?
Excel data analysis:
summary
❑ Download the CorrelationExample1 Excel file
from today’s Google Drive folder
❑ Copy the data onto a new sheet
❑ Create and format a scatter plot
❑ Make the data analysis button visible
❑ Run a linear regression analysis
❑ Format and interpret the output
❑ Add a trend line to the scatter plot
❑ Open today’s Google Drive folder
❑ Download the file CorrelationExample1 and save it
onto your computer (e.g. Desktop) then open it
The first goal is to create a scatter plot which looks
like this:
❑ This file contains the data for
this example in 3 columns
❑ Click on this button to create
a new sheet (automatically
called Sheet1)
❑ Double click on its name and
change it to “Excel Analysis”
❑ Now go back to the first
sheet and click here
❑ This selects the entire
sheet
❑ Select copy
❑ Go to the second sheet
❑ Press the Return key on
the keyboard
❑ This pastes the entire
sheet onto the new sheet
(along with its formatting)
❑ Select this data
❑ Click on the Insert tab
❑ Select this button and choose the first option
The sheet should now look like this:
Click and drag the chart nearer to the data
The next goal is to make the chart look like this:
❑ Click on the Home tab
❑ Change the font to Arial, the font colour to black
and the font size to 11
❑ Select the
Design tab
❑ Select Add
Chart
Element >
Chart Title >
None
❑ This
removes the
chart title
❑ Select Add Chart Element > Axis Title > Horizontal
❑ Enter “Income (£000s)”
❑ Go to the Home tab and select Bold
❑ The chart should now look like this:
❑ Repeat with the primary vertical axis title
❑ Change it to “Insurance (£000s)” and make it
bold
❑ The chart should now look like this:
❑ Double click on the background of the chart
❑ This opens the Format Plot Area dialog box
❑ Click on the Fill option and select Solid fill
❑ Select the Color menu and More colors…
❑ Select the Standard tab and the yellow hexagon next to
the white one
The chart should now look like this:
The last thing to change is the colour of the
markers
❑ Double click on one of the markers
❑ This opens the Format Data Series dialog box
❑ Select the paint tab and the Marker option
❑ Select Solid fill and change the colour to red
❑ Select Solid line and change the colour to red
Making the Data Analysis
button visible
Select the File menu and the Options button
❑ Select the
Add-ins option
❑ This dialog
box should
appear
❑ Analysis
ToolPak
should already
be selected
❑ Select the
Go… button
❑ Finally, select the
Analysis ToolPak and
click on OK
❑ Now click on the Data
tab
❑ A new button should
appear on the Data tab
here
❑ Select the Data Analysis button
❑ Choose Regression from the list and OK
❑ This dialog box should appear:
❑ Enter C1:C21 for
the Y Range
❑ Enter B1:B21 for
the X Range
❑ Select Label
❑ Select Output
Range and enter
the value N2
The output should now look like this:
❑ Change the font to Arial
❑ Adjust the column widths to make the text visible
❑ Reduce the number of decimal places of the numbers
with decimals to 5
❑ The output should now look like this:
The important values are:
❑ The Multiple R coefficient
❑ The Coefficients of the Intercept and the variable
❑ The P-value of the variable
❑ Finally, we shall add a trend line to the scatter plot
❑ Select the scatter plot
❑ On the Design tab select Add Chart Element > Trend
Line > Linear
❑ Select the trend line
❑ On the Format tab use the Shape Outline button to
change its colour to black and its dashes style to dashes
Compare your file with the file
CorrelationExample1Completed:
SPSS data analysis:
Summary
❑ Download SPSS onto your computer – see
other instructions
❑ Open the CorrelationExample1 data file from
today’s Google Docs folder
❑ Create a scatter plot
❑ Run a correlation analysis
❑ Interpret the output
❑ Run a linear regression analysis
❑ Interpret the output
❑ Add a trend line to the scatter plot
Start SPSS. For a
windows computer it is
available either from the
folder IBM SPSS
Statistics or by
searching for SPSS in
Cortana
❑ Close the Excel
file (be sure
that you know
which folder it is
saved in)
❑ Close this
dialog box
❑ You should be
left with an
empty data
window
❑ Change the
folder to the
folder where
you saved the
Excel file
❑ Change the
file type to
Excel
❑ Select the file
❑ Click on Open
❑ Select OK with
the second
dialog box that
appears Alternatively, you can just open the
SPSS data file from Google Drive
❑ The data window
should now look like
this:
❑ Click on the Variable
View tab
❑ Change the Measure
of Household to
Ordinal
❑ Now save the file
❑ Select Graphs > Legacy dialog > Scatter/dot…
❑ Click on the Define button
❑ Put Insurance
on the Y Axis
and Income on
the X Axis
❑ Select OK
This scatter plot should appear in the Output window:
❑ Select Analyze > Correlate > Bivariate
❑ Put Income and Insurance in the variables list
and click on OK
This output should appear in the Output window:
As before, this shows a large correlation which is significantly
different from 0 at the 0.001 level (very strong evidence)
❑ Select Analyze > Regression > Linear…
❑ Add the variables as shown and click on OK
❑ This output
should
appear:
❑ Confirm that
these details
are the same
as in the
previous class
and also
consistent
with the Excel
analysis
❑ Scroll up the
Output window
and double
click on the
scatter plot
❑ This opens the
Chart Editor
window
❑ Select this
button then
close the
Properties
window and
this window
The scatter plot should now have an added trend line:
Confirm that
the constant
and slope
coefficients
are the
same as in
the output
table
Saving output
❑ Rather than saving an
output file or copying
and pasting in the
normal way we
recommend that you
right click on a table or
chart and select Copy
Special and Image
❑ The image can the be
pasted into a Word file
(without any formatting)
Part C. Chi-squared test
Example 1: managers’ views
on Brexit and future profits
A researcher collects data on Their views are shown
the opinions of 200 company in this table:
managers about the likely
Improve Worsen
impact of Brexit on their firm’s
North 50 50
profits and wishes to test if
there is any association South 35 65
between the location of the
manager (in the North or South
of England) and their opinion.
Research question
Is there an association between location and
views on the impact of Brexit on company’s
future profits?
Excel data analysis: Summary
❑ Download the ChiSquareExample1 Excel file from
today’s Google Docs folder
❑ Create a pivot table of Brexit impact against
location
❑ Create and format a percentage frequency chart
❑ Calculate the expected values using relative and
absolute formulas
❑ Assess the validity of the chi-squared test
❑ Calculate the chi-squared statistic and probability
value
❑ Interpret the probability value
The first goal is to create a percentage frequency
bar chart that looks like this:
❑ Download the file
ChiSqauredExample1
from this today’s
Google Docs folder
and save it on your
computer (e.g.
Desktop) then open it
❑ Open the sheet “Excel
Data and
Descriptives”
❑ Select Insert > Pivot
Table
❑ This dialog
box should
appear
❑ Select
Existing
Worksheet
and enter the
Location E2
then click on
OK
The interface should now look like this:
Using the dialog box on the
right, click and drag Location
into the ROWS list,
BrexitImpact into the
COLUMNS list and either of
them into the VALUES list
The Pivot Table should now look like this:
Select the Pivot Table and change its font to
Arial
❑ Right click on one of the data values
❑ Select Show Values as > % of Row Total
The table should now look like this:
❑ Right click one of the values again
❑ Select Number Format…
❑ Reduce the number of decimal places to 0
❑ Click on the Analyze tab
❑ Select the PivotChart button and choose the default chart
A chart should appear looking like this:
Right click on one of these grey buttons and select Hide
all Field Buttons from Chart
Goal: make the chart look like this:
Follow
the same
process
as before
❑ Change the font style, colour and size
❑ Add axis titles in bold
❑ Change the background colour and bar colours
❑ Now go to the sheet Excel Analysis
❑ In cell C12 add the formula “=$E4*C$6/$E$6” (you can
copy and paste it from these slides
❑ Click on the cell C12 again
– it will then show you
which cells the formula
refers to (in colour)
❑ $ fixes a row or column in
a cell when it is copied
❑ The row is fixed in the red
cell (C$6)
❑ The column is fixed in the
blue cell ($E4)
❑ Both are fixed in the purple
cell ($E$6)
❑ Now copy and paste the contents of C12 into the other
cells in this table
❑ The table should now look like this:
❑ These are the expected frequencies for the four cells
❑ The chi-squared test is valid because all four of these
values are greater than 5
❑ In cell B17 enter the formula
“=CHISQ.TEST(C4:D5,C12:D13)”
❑ This should return the value
0.031905
❑ This is less than 0.05 but bigger
than 0.01, indicating there is
evidence of an association (the
“climate change” row of the
significance table) between
Location and Brexit Impact
❑ Compare your sheet with the
same sheet in the file
ChiSquaredExample1Completed
SPSS data analysis:
summary
❑ Import the data from Excel or download and
open the ChiSquareExample1 SPSS data file
from this week’s Moodle page
❑ Carry out a cross-tabulation, creating a table of
the counts and expected counts
❑ Assess the validity of the chi-squared test
❑ Carry out a chi-squared test
❑ Interpret the output
❑ Save the file
ChiSquareExample1
on your computer (e.g.
on the Desktop). Make
sure this file is closed.
❑ Open a new data file in
SPSS and search for
the folder containing
this file
❑ Change the file type to
Excel – this file should
appear in the list
❑ Open this file
❑ Select the sheet SPSS
Raw Data in the dialog
box that appears
❑ On the Variable View change the measure of ID to Ordinal
❑ Select the Values option for Location. In the dialog box
enter 1 and North then Add, then 2 and South then Add
❑ Repeat with Improve and Worsen for BrexitImpact
❑ Select Analyze > Descriptive Statistics > Crosstabs…
❑ Put Location in the rows and BrexitImpact in the columns
❑ Select Cells… and add Expected
❑ The generated table enables us to assess the
validity of chi-squared
❑ As none of the expected counts is less than 5
we can run the chi-squared test
❑ Select Analyze > Descriptive Statistics > Crosstabs…
again
❑ The settings from the previous analysis have been
retained
❑ Select Statistics… and Chi-square
❑ The significance values of the Pearson Chi-Square is
0.032 (same as in the Excel analysis)
❑ Significant at 0.05
❑ Provides evidence of an association between Location
and BrexitImpact
Exercise
Repeat the activities (both in Excel and SPSS)
with CorrelationExample2 and
ChiSquaredExample2 (both available from today’s
Google Docs folder).
Note:
❑ The chi-squared test for a 6 cell grid is valid
provided that 5 of the expected frequencies are
greater than 5 and the sixth is greater than 1
❑ The probability value is very small – you will
need to change the number format to Number
and increase the number of decimal places
Recap
We have covered:
❑ Part A. Brief introduction to statistics
❑ Part B. Correlation and regression:
➢ Excel analysis
➢ SPSS analysis
❑ Part C. Chi-squared test:
➢ Excel analysis
➢ SPSS analysis
❑ Exercises
Plenary discussion
❑ How did you get on?
❑ Did you encounter any problems?
❑ What were the main things you learnt?
References
Field, A. (2018) Discovering Statistics.
https://www.discoveringstatistics.com/.
Field, A. (2018) Discovering Statistics using IBM SPSS
Statistics. 5th edn. London: SAGE.
Samuels, P. C. (2022) Quantitative Software Training
Summary.
https://www.researchgate.net/publication/360131234_
Quantitative_Software_Lab_Training_summary.
statstutor (n.d.) Statistics Support for Students -
www.statstutor.ac.uk. http://www.statstutor.ac.uk/.
Stirling, W. D. (2019) Textbooks for learning statistics:
CAST e-books. https://cast.idems.international/.