Introduction to Biostatistics
Shamik Sen
Dept. of Biosciences & Bioengineering
IIT Bombay
Standard Deviation: Practical Significance
Chebyshev’s Theorem:
Given a number k (greater than 1) and a set of N
measurements, at least 1 − 1/𝑘2 of the measurements will lie
within k standard deviations of their mean
Example: n = 26, Mean = 75, Variance = 100. Comment on the
distribution.
Standard Deviation of Normal Distributions
+/- 1 s.d.: 68%
+/- 2 s.d.: 95%
+/- 3 s.d.: 99.7%
Relative Frequency
Height
Relative Standing: z-score
Z-score = (x-Mean)/Standard Deviation
Example: Mean = 25, Standard Deviation = 4; x = 30
Z-score > 3 is an outlier!
Example: 1, 1, 0, 15, 2, 3, 4, 0, 1, 3
Relative Standing: Percentiles
‘p’th percentile is the value which is greater than p % of the
measurements
Q1: first quartile (at position 0.25*(n+1))
Q3: third quartile (at position 0.75*(n+1))
Inter-quartile Range = IQR = Q3 – Q1
Box-plot
Min, Q1, Median, Q3, Max
Area (sq. microns)
Detecting Outliers
Lower Fence = Q1 – 1.5 IQR
Upper Fence = Q3 + 1.5 IQR
Plotting Box-Plot
340, 300, 520, 340, 320, 290, 260, 330
Introduction to Biostatistics
Shamik Sen
Dept. of Biosciences & Bioengineering
IIT Bombay
Moments
Given a set of observations yi of a variable Y, the rth sample
moment about zero is defined as:
Moments
The rth sample moment about the mean is defined as:
Skewness
Kurtosis
Introduction to statistical analysis
with R
What is R?
• Software environment for statistical computing and data
analysis
• R is a GNU package and source code of R is freely available.
• Pre-compiled binary versions are provided for various
operating systems.
• R has a command line interface. But many graphical user
interfaces are available.
• R can produce publication-quality graphs with
mathematical symbols
R is an interpreted language
Applications of R
• Mainly used by statisticians and other practitioners
requiring an environment for statistical computation and
software development.
• R supports matrix arithmetic and can also operate as a
general matrix calculation toolbox – with performance
benchmarks comparable to GNU Octave or MATLAB
• R can be used to perform high-performance statistical
computation required for statistical analysis of Big Data.
• R is also being used in Business Analytics.
Getting R - 1
• R is an open source programming language. Due
to its popularity pre-compiled R binaries are also
available for different platforms.
• Binaries for windows, Unix or MacOS can be
downloaded from R project website
https://www.r-project.org.
• These binaries can directly be used to install the R
programming of a computer.
Getting R - 2
• However, R is command line so may not be
suitable for learners.
• For this, many graphical under interfaces
(GUIs) software are available for R.
• These GUIs-based software provide an user
friendly interface to write, correct and run R
code.
• Rstudio is one such widely used GUI interface
for R.
Getting R - 3
• RStudio
workspace
Command
windows
Additional
information
Creating vectors in R
Custom vector
Sequence
Repeat
Repeat of range
Repeat of sequence
All variables are vectors. Variables are case sensitive.
Basic operations on vectors - 1
Scalar addition
Scalar subtraction
Scalar multiplication
Scalar division
Element-wise sum
Element-wise multiplication
Element-wise square
Element-wise exponential
Importing data to R - 1
CSV: comma separated values
.xlsx format
.csv format
Importing CSV data to R
Workspace
Calculating descripting statistics in R-1
Finding frequency in categorical data
Mean and medium
Minimum and maximum
Calculating descripting statistics in R-2
Variance and standard deviation
Alternatively
Plotting in R - 1
Plotting in R - 2
Data Plotting barplots