Which stats package
should I use?
Data and Decision Science Network
Part of the UOW Data and Decision Science Initiative
Marijka Batterham, Bradley Wakefield, Alberto Nettel-Aguirre
Outline
Which Stats package should I use?
• Introductions
• Data and Decision Science Network – why are we giving this talk?
• The first step – Why do you need the stats package? What is your research question?
• How many stats packages are there?
• Let’s have a look at
• SPSS
• R (Rstudio)
• STATA
• Jamovi
• SAS
• Python
• Excel
• Comparison – t test
• How will you choose?
• Take home message
• Where to from here?
Introductions
• Professor Marijka Batterham • Brad Wakefield • Professor Alberto Nettel-Aguirre
• Co-Ordinator Data & Decision • Statistical Consultant in the Stats • Director CHSA
Science Initiative Consulting Centre.
• Crusader for correct use and
• Director NIASRA • Rstudio is my go-to but understanding of biostatistics
commonly use other packages in
• Director Stats Consulting teaching and consulting. • Enjoys teaching stats-without-
Centre
• Interests in data privacy, pain to other disciplines
• Passionate about data literacy probability theory, statistical
• R/Rstudio preference,
• Use RStudio/SPSS most often inference, and data analytics.
STATA/SPSS due to
• Favourite analysis: logistic • Passion for ethical applications of collaborations.
regression data science methods in research
and industry. • Python exposure due to
• Mostly use: mixed models consultancy work
• Enjoys learning and collaborating
• Likes learning machine with other disciplines and solving
learning/data mining & • Always wanting to learn and try
real-world problems.
exploring new packages new techniques
• Always up for a chat.
UOW Data & Decision Science Initiative
• The Data and Decision Science Initiative is part of the UOW
strategic Plan (2.5 Transformative technologies)
• Developed from a 2019 review and recommendations of “Big Data”
and Health Informatics at UOW
• Updated to reflect UOW in 2021, presented at VCAG in April
commenced July 2021
Data Science is the extraction of actionable
knowledge directly from data through a process of Domain
discovery, or hypothesis formation and hypothesis knowledge
testing
Data
Science
Statistics Computer
science
Data & Decision Science Initiative
four key areas of focus
Research: virtual network and working groups of Data and Decision Science researchers
• Focal point for coordinating the development of Data Science at UOW
• Composed of researchers actively using or interested in Data Science methods
• Themed meetings emphasising translation: Data and Decision Science Network (DDSN)
• Strategically collaborations through the DDSI give a competitive advantage in translation
Education: Training in data science and reproducibility of research.
• Internal and external training and education in data science
• Upskilling research students & staff (particularly ECRs) in data & decision science methods
• Workshops (GRS, Statistical Consulting Centre)
T shaped graduates: Reviewing service subjects to refocus on data science.
• Review of service subjects in statistics and quantitative methods offered through SMAS to give data
science focus
• Graduates literate in data science and reproducible research
External/Industry engagement: Capitalising on existing links
• Provide enhanced opportunities for external engagement
Choosing a Stats Package
• Why do you need a stats package?
• Data manipulation
• Descriptive statistics
• Visualisation
• Modeling/inference
• teaching
• What is your research question?
• What does your data look like?
What is your research question?
• Are you describing a sample/population?
• Are you looking at differences or relationships between groups?
• Is visualization (pretty figures and graphs) important?
• Are you analysing survey data?
• Are you investigating a change over time?
• Is there missing data?
• Is there clustered/multilevel data?
• Will your model be complicated (non linearities, assumption violations,
interactions)
What does your data look like?
• How big is your dataset?
• Can you do it on your laptop? Do you need special computing facilities,
high performance computing.
• Do you have to link datasets?
• Is your data complicated (linked, relational, administrative)
• Are you going to be working on this project for a long time?
• Are you working with collaborators in other schools, disciplines, UOW
(different packages)?
• Are you teaching with stats software?
How many Stats packages are there?
Stats/Data Science packages open •
•
OpenBUGS
OpenEpi – epidemiology and statistics
• ADaMSoft – a generalized statistical software with data mining and data • OpenNN – neural networks,, deep learning
management
• OpenMx – A package for structural equation modeling running in R
• ADMB – non-linear statistical modeling (programming language)
• Chronux – for neurobiological time series data • Orange, a data mining, machine learning, and bioinformatics software
• DAP – free replacement for SAS • Pandas – High-performance computing (HPC) data analysis tools
for Python in Python and Cython (statsmodels, scikit-learn)
• Environment for DeveLoping KDD-Applications Supported by Index-
Structures (ELKI) data mining in Java • Perl Data Language – Scientific computing with Perl
• Epi Info – statistical software for epidemiology developed by the CDC] • Ploticus – software for generating a variety of graphs from raw data
• Fityk – nonlinear regression software
• PSPP – A free software alternative to IBM SPSS Statistics
• GNU Octave – programming language very similar to MATLAB with statistical
features • R – free implementation of the S (programming language)
• gretl – gnu regression, econometrics and time-series library • ROOT – data storage, processing and analysis, developed by CERN and
used to find the Higgs boson
• intrinsic Noise Analyzer (iNA) – analyzing intrinsic fluctuations in biochemical
systems • Salstat – menu-driven statistics software
• JASP – A free software alternative to IBM SPSS Statistics with additional option for • SciPy – Python library for scientific computing
Bayesian methods
• scikit-learn – extends SciPy with a host of machine learning
• Just another Gibbs sampler (JAGS) – a program for analyzing Bayesian models (classification, clustering, regression, etc.)
hierarchical models
• statsmodels – extends SciPy with statistical models and tests
• JMulTi – For econometric analysis, univariate and multivariate time series analysis
• Shogun, large-scale machine learning toolbox that provides several SVM
• KNIME – analytics platform built with Java and Eclipse using modular data pipeline (Support Vector Machine) implementations
workflows
• Simfit – simulation, curve fitting, statistics, and plotting
• LIBSVM – C++ support vector machine libraries
• SOFA Statistics – desktop GUI program focused on ease of use, learn as
• mlpack – open-source library for machine learning you go, and beautiful output
• Mondrian – data analysis & interactive statistical graphics with a link to R • Stan (software) – open-source package for obtaining Bayesian inference
• Neurophysiological Biomarker Toolbox – data-mining of neurophysiological • Statistical Lab – R-based and focusing on educational purposes
biomarkers
• TOPCAT – graphical analysis and manipulation package for astronomers.
• Torch – a deep learning software library written in Lua
Source: Wikipedia • Weka – machine learning software
Some proprietary stats packages
• Alteryx – statistical models; R and Python integration • MaxStat Pro – general statistical software
• SigmaStat – package for group analysis
• Analytica – visual analytics and statistics package • MedCalc – for biomedical sciences
• SmartPLS – partial least squares path modeling (PLS)
• Angoss – data mining algorithms • Microfit – econometrics package, time series
• SOCR – teaching statistics and probability theory
• ASReml – for restricted maximum likelihood analyses • Minitab – general statistics package
• Speakeasy – statistical and econometric analysis features
• BMDP – general statistics package • MLwiN – multilevel models (free to UK academics)
• SPSS Modeler – data mining and text analytics workbench
• DataGraph – visual analysis and regression • Nacsport Video Analysis Software – analysing sports
• SPSS Statistics – comprehensive statistics package
• DB Lytix – 800+ in-database models • NAG Numerical Library – math and statistics library
• Stata – comprehensive statistics package
• EViews – for econometric analysis • Neural Designer – commercial deep learning package
• NCSS – general statistics package • StatCrunch – comprehensive statistics package
• FAME (database) – managing time-series databases
• NLOGIT – statistics and econometrics package • Statgraphics – general statistics package
• GAUSS – programming language for statistics
• nQuery Sample Size Software – Sample Size/Power • Statistica – comprehensive statistics package
• Genedata –experimental data in life science R&D
• O-Matrix – programming language • StatsDirect – statistics for public health, health science
• GenStat – general statistics package
• GLIM – generalized linear models • OriginPro – statistics and graphing, • StatXact – exact nonparametric and parametric statistics
• GraphPad InStat – • PASS Sample Size Software (PASS) – power/sample size • Systat – general statistics package
• GraphPad Prism – biostatistics nonlinear regression • Plotly – plotting library fo R, Python, MATLAB, Julia, Perl • SuperCROSS – comprehensive statistics package
• IMSL Numerical Libraries – software library • Primer-E Primer – environmental and ecological specific • S-PLUS – general statistics package
• JMP – visual analysis and statistics package • PV-WAVE – data analysis/visualization • Unistat – general statistics package
• LIMDEP – statistics and econometrics • Qlucore Omics Explorer – data analysis software • The Unscrambler – multivariate analysis
• LISREL – structural equation modeling • RapidMiner – machine learning toolbox
• WarpPLS – structural equation modeling
• Maple – programming language with statistical features • Regression Analysis of Time Series (RATS) – econometrics
• Wolfram Language[6] – some statistical capabilities
• Mathematica – some statistical features • SAS (software) – comprehensive statistical package
• World Programming System (WPS) – supports use
• MATLAB – programming language with statistics • SHAZAM– econometrics and statistics package of Python, R and SAS within single user program.
• Simul – econometric tool multidimensional modeling • XploRe
Source: Wikipedia
Commonly used packages at UOW
IBM® SPSS® Statistics
• IBM commercial product
• Statistical Package for the Social Sciences, first released 1968
• Widely used in teaching
• Many online resources
• Menu driven GUI
• Good for standard and most common advanced methods
• Nice missing data/multiple imputation options
• Bayesian analyses
• New meta-analysis capacity in V28
• New workbook facility incorporating syntax in version 28.
• https://www.ibm.com/products/spss-statistics
IBM® SPSS® Statistics - cons
• Menu use is repetitive and time consuming, switch to syntax if using
frequently
• Outputs everything for some analyses, can be overwhelming
• Graphing capacity is limited but editable.
• Major interface change is currently in beta testing
R (R Studio) Pros
• Free and open source
© https://www.r-project.org/logo/
• Released in 1995 developed by Ross Ihaka and Robert Gentleman at the
University of Auckland, based on the S Plus software package
• Relies on active user community to develop and maintain discipline specific
packages
• R open source programming language designed for statistical analysis
• R is not often used stand alone, ubiquitously used through and Integrated
Development Environment (IDE) R Studio most widely used, there are others eg
EMACS.
• R Studio has a commercial arm which supports business and funds the free
development.
• Extensive standard and advanced statistical methods
• Constantly increasing number of statistical packages, discipline specific
packages
• You can develop your own statistical package
• Encourages reproducible research
R (R Studio) Cons
• Steep learning curve
• Dependencies, relies on user community to maintain packages
• Some packages dependent on others may cease working
• Work arounds require advanced knowledge (note that can at least save the
versions used as part of RR)
• Changing constantly – stay up to date
• Base R versus the Tidyverse (two ways to use R)
R User interface packages
• There are many of these
• Jamovi https://www.jamovi.org/
• JASP https://jasp-stats.org/
• BlueSky https://blueskystatistics.com
• Rcommander https://socialsciences.mcmaster.ca/jfox/Misc/Rcmdr/
• R-Instat www.r-instat.org
• Deducer www.deducer.org
• RKWard https://rkward.kde.org/
• Rattle https://rkward.kde.org/
• RAF https://r.analyticflow.com/en/
www.stata.com
• First released in 1985, StataCorp, Bill Gould
• Code driven and GUI
• Reasonably priced for academics/students
• Good for standard and many advanced methods including robust analysis
• Nice for survey analysis, many Australian surveys have STATA code for weighting
• Nice for meta-analysis (more advanced options in R)
• Some user written ado files eg stepped wedge designs
• Multiple imputation for missing data
• SEM
• Used extensively in epidemiology, public health, social science
STATA® - Cons
• Learning the code
• Menu driven options not as user friendly as other packages
• Only used in some discipline areas
Pros
• Open source project, 2 of 3 founders are Australian
• Looks like SPSS
• Has good support/longevity
• Nice modules for analyses commonly performed procedures,
• immediately visible output
• Great for introductory teaching
• Has free online textbook, many online resources
• Used in teaching at UOW
Jamovi -Cons
• Output to pdf, html
• Output to word through editing pdfs
• Unable to edit graphs
• Dependent on existing modules (this is constantly increasing)
• Currently no machine learning, AI modules
www.sas.com
Pros
• commercial product developed from 1966-76, SAS Institute Inc.
• Statistical Analysis System
• AI business market focus
• Substantial investment in AI capacity
• Planning for IPO listing in 2024
• Gold standard for pharmaceutical trials and governments (? R use increasing)
• Runs on code line using DATA and PROC statements
• Available free for academic use if used in the cloud
• SAS® OnDemand for academics
• SAS Studio has pull down menu options
SAS® - Cons
• Menu driven SAS Studio not as intuitive as other packages
• Learning the code, unique to SAS
• Still currently used in health, pharma and business. Undergoing
generational change as new analysts come through
Pros
• Open source general-purpose programming language, cross-platform
• Released in 1989, developed by Guido van Rossum at Centrum Wiskunde & Informatica in the
Netherlands named after Monty Python.
• Nice interface with low-level languages and GPU acceleration.
• Used extensively in web development.
• Gained popularity for machine learning/ deep learning/data science
• Supported by user community
• Used in scalable production environments
• Many libraries are available
• Can be used in IDE and others like Jupyter: Web-based, interactive computing notebook
environment.
• Can be run local or server (Google colaboratory)
• No braces, no semi-colons, indentation is used to structure code.
• Not so steep learning curve
• Currently Number 1 programming language https://www.tiobe.com/tiobe-index/
Python - Cons
• Not as many stats packages as R (though many general programming
packages)
• No braces, no semi-colons, indentation is used to structure code.
• Reading someone else's code can be tough for beginners
• Visualizations possible but not as good as others ( R )
• High memory usage
Do Not Use Excel for Data Analysis
• No capacity to store code of changes made
• Analysis is not reproducible
• Formulas are hidden in cells and can be accidentally overwritten
• Easy to accidentally change numbers and no way to trace this
• Limited data size
• Encourages chartjunk
• “Friends don’t let Friends use excel for statistics”
(Cryer, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.617.4297&rep=rep1&type=pdf )
Demonstration of t test in different packages
Research question
Sample dataset compares Body Mass Index BMI(kg/m2) between people with
and without diabetes. Simulated from the Pima Indian Dataset. Dabelea et al.
Journal of Maternal Fetal Medicine 2000;9:83-88.
BMI is continuous and reasonable to assume normally distributed,
Diabetes is categorical(binary) 0= no diabetes, 1=diabetes
Research question: Is there a significant difference in mean BMI between
those with and those without diabetes?
T test procedure
1. Always, Always, Always plot your data – side by side boxplots
2. Check assumptions
• equality of variance (Levene’s test)
• Normality of groups (Shapiro-Wilk)
3. Perform t test
One way it might be written for a paper Group number Mean(SD)
BMI kg/m2
• Methods: Data analysis
• An independent two-sided t test was No 223 31.21(5.68)
used to determine if the difference in diabetes
BMI between those with and those
without diabetes was statistically Diabetes 109 35.86(5.01)
significant. Assumptions were tested
visually prior to analysis, homogeneity
of variance was assessed using the
Levene’s test and normality using the
Shapiro-Wilk test. An alpha level <
0.05 was considered statistically
significant. Data was analysed using
(Stats package, Version, company)
• Results:
• Report difference (CI or SD), t
statistics and df.
• The mean difference in BMI between
those with and without diabetes was
4.65(SD 5.47)kg/m2, (CI 3.39,5.90),
t=7.27(df=330), P<0.001.
• Performing the t test in
SPSS,
Rstudio,
STATA,
Jamovi,
SAS and
Python
Package Good for
Jamovi Teaching, infrequent use of stats (easy to pick up again if you have a break),
basic analysis some advanced methods, easy to learn, good default outputs
Python Machine learning, AI, in demand skill, regular users, good for research
collaboration and integration to web platforms, regular user
Rstudio Data manipulation, visualisation, advanced analysis, in demand skill,
reproducible research, advanced missing data options, regular users
SAS Good overall package for most standard and many advanced methods, regular
user, big data, good for pharma and govt
SPSS Good overall package for most standard and many advanced methods, easy to
learn, infrequent use
STATA Good overall package, has many useful advanced procedures, Used regularly
in some professions, particularly good for survey analysis, meta analysis, SEM
Take home message
• If you are analysing data, it is likely you will need to use more than one
package during your career
• All packages are changing significantly over time as more methods
become available and computing power increases
• If you publish - Learn a package that encourages reproducible research
• To have a competitive advantage now for your career use R (or Python)
• If you don’t use stats much stick to what you know, and ask a
professional
• Regardless of the package you will need to understand the stats to
perform and interpret the output.
• If doing advanced, specialized or machine learning methods “best”
package will depend on ease of use for that analysis. Some packages do
not have advanced methods eg SPSS does not have a menu option for
Generalised Additive Models GAMs (can access this through the R plugin
in SPSS), JAMOVI has only the modules developed.
SPSS
RStudio
Write your code in the Script window
Select Run or CTRL + ENTER to run your code.
INSTALLING PACKAGES
• When using R and RStudio you may need to install packages in order to
run the analyses.
• Whilst many functions are included in base R, installing packages is easy!
• To install packages all you need to do is call
install.packages("package-name")
R will look online, download it, and install it for you.
install.packages('tidyverse')
install.packages('car’)
• You will need to load the library to use it!
library(tidyverse)
READING IN YOUR DATA
Loading in your data is pretty straightforward!
data <- read_csv("pathto/ttestdiabetes.csv")
Or you can just find it in your files window in RStudio!
THE DATA IS NOW LOADED
data <- read_csv("ttestdiabetes.csv")
head(data)
## # A tibble: 6 x 8
## npreg gluc bp skin ped age Diabetes BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.72 80 11 0.491 22 0 17.9
## 2 1 5.11 62 25 0.482 25 0 17.9
## 3 1 5.55 74 12 0.149 28 0 18.3
## 4 1 5.27 66 13 0.334 25 0 18.5
## 5 6 7.16 90 7 0.582 60 0 20.2
## 6 0 5.83 68 22 0.236 22 0 20.5
data$Diabetes <- factor(data$Diabetes)
levels(data$Diabetes) <- c("No Diabetes","Diabetes")
OBTAINING A BOXPLOT
boxplot(BMI ~ Diabetes,data = data,col = c('blue'))
OBTAINING A BOXPLOT (GGPLOT2)
ggplot(data)+ geom_boxplot(aes(x=Diabetes,y=BMI))
WHAT ABOUT YOUR DESCRIPTIVES?
You can go line by line…
data0 <- filter(data,Diabetes == "No Diabetes")
data1 <- filter(data,Diabetes == "Diabetes")
mean(data0$BMI)
## [1] 31.21093
sd(data0$BMI)
## [1] 5.675752
summary(data0$BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.89 27.33 31.21 31.21 34.88 49.05
mean(data1$BMI)
## [1] 35.85649
sd(data1$BMI)
## [1] 5.009008
summary(data1$BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.28 32.41 35.94 35.86 38.96 47.72
WHAT ABOUT YOUR DESCRIPTIVES?
But there are always multiple ways in R…
data <- group_by(.data=data,Diabetes)
summarise(.data=data,
Avg = mean(BMI),Std_Dev = sd(BMI),Min = min(BMI),
Max = max(BMI),Median = median(BMI),Q1 = quantile(BMI,0.25),
Q3 = quantile(BMI,0.75))
## # A tibble: 2 x 8
## Diabetes Avg Std_Dev Min Max Median Q1 Q3
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 No Diabetes 31.2 5.68 17.9 49.0 31.2 27.3 34.9
## 2 Diabetes 35.9 5.01 21.3 47.7 35.9 32.4 39.0
data %>% group_by(Diabetes) %>%
summarise(N = length(BMI), Avg = mean(BMI), Std_Dev = sd(BMI),
Min = min(BMI), Max = max(BMI), Median = median(BMI),
Q1 = quantile(BMI,0.25), Q3 = quantile(BMI,0.75))
CHECKING ASSUMPTIONS
You only get out what you ask… Here’s the Shapiro Wilk Test
shapiro.test(data0$BMI)
##
## Shapiro-Wilk normality test
##
## data: data0$BMI
## W = 0.99531, p-value = 0.7283
shapiro.test(data1$BMI)
##
## Shapiro-Wilk normality test
##
## data: data1$BMI
## W = 0.9926, p-value = 0.8256
CHECKING ASSUMPTIONS
But you can always manipulate in a better way!
summarise(.data=group_by(.data=data,Diabetes),
Shapiro_W = shapiro.test(BMI)$statistic,
Shapiro_p = shapiro.test(BMI)$p.value)
## # A tibble: 2 x 3
## Diabetes Shapiro_W Shapiro_p
## <fct> <dbl> <dbl>
## 1 No Diabetes 0.995 0.728
## 2 Diabetes 0.993 0.826
CHECKING ASSUMPTIONS
You’ll need to load a library for the homogeneity of variances check!
library(car)
leveneTest(BMI~Diabetes,data=data)
## Levene's Test for Homogeneity of Variance
(center = median)
## Df F value Pr(>F)
## group 1 2.433 0.1198
## 330
PERFORMING THE T-TEST
Make sure to specify the correct parameters.
t.test(BMI~Diabetes,data=data,var.equal = TRUE)
##
## Two Sample t-test
##
## data: BMI by Diabetes
## t = -7.2715, df = 330, p-value = 2.599e-12
## alternative hypothesis: true difference in means between
group No Diabetes and group Diabetes is not equal to 0
## 95 percent confidence interval:
## -5.902336 -3.388790
## sample estimates:
## mean in group No Diabetes mean in group Diabetes
## 31.21093 35.85649
PERFORMING THE T-TEST
Make sure to specify the correct parameters.
t.test(BMI~Diabetes,data=data,var.equal = TRUE)
##
## Two Sample t-test
##
## data: BMI by Diabetes
## t = -7.2715, df = 330, p-value = 2.599e-12
## alternative hypothesis: true difference in means between
group No Diabetes and group Diabetes is not equal to 0
## 95 percent confidence interval:
## -5.902336 -3.388790
## sample estimates:
## mean in group No Diabetes mean in group Diabetes
## 31.21093 35.85649
PERFORMING THE T-TEST
If you are unsure of what parameters are needed and what their
default values are just ask for help
help("t.test")
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
STATA
STATA syntax
Jamovi
The window shown below should be open.
Tab Bar
Spreadsheet View Results View
Go to variables view and delete the three default variables.
Go to File (≡); Import; Browse and then select your Dataset.
Select the Data Tab and you should see your Data.
QUALITATIVE VARIABLES IN JAMOVI
Categories of qualitative variables are referred to as Levels in jamovi.
Levels can be given
text labels in the
Levels list.
Exploratory Data Analysis is always essential.
You can perform basic EDA with Analyses; Exploration; Descriptives
Results are shown in an editable and dynamic results window.
Variable
Factor
A range of Descriptive statistics can be produced.
Plots are produced immediately and with some customisations.
To perform the T-Test select
Analyses; T-Tests; Independent Samples T-Test
Output is generated as options are selected.
You Assumption Checks can be found and selected easily!
Unpooled (Welch’s), non-parametric (Mann-Whitney U),
confidence intervals, and alternate hypothesis options are all
found in the same window!
To see the code….
Select
Options
Select Syntax
This is R code!
mode
jamovi outputs can be
generated directly in R
with the jmv package
and the code.
SAS Studio
Click on tasks to
open the statistics
menu
Select Data Exploration, click on
the spreadsheet icon to identify
the dataset, click + to add the
analysis variable and the category,
click on plots check the BOX PLOT
click run
Levene’s test through one-way
anova
Normality test done as part of the t test
Python