Statistical Modeling
Statistical Modeling
STATISTICAL MODELLING
Data Analytics 1
GyanData Private Limited
Contents
Data Analytics 2
GyanData Private Limited
Data Analytics 3
GyanData Private Limited
Random Phenomena
Deterministic phenomenon: Phenomenon whose outcome can be
predicted with a very high degree of confidence
◦ Example: Age of a person (using date of birth stated in Aadhaar card)
Stochastic phenomenon: Phenomenon which can have many possible
outcomes for same experimental conditions. Outcome can be predicted
with limited confidence
◦ Example: Outcome of a coin toss
Data Analytics 4
GyanData Private Limited
Data Analytics 5
GyanData Private Limited
Data Analytics 6
GyanData Private Limited
Data Analytics 7
GyanData Private Limited
Probability Measure
Probability measure is a function that assigns a real value to every
outcome of a random phenomena which satisfies following axioms
P(S) = 1 (one of the outcomes should occur)
0 ≤ P(A) ≤ 1 (Probabilities are non-negative and less than 1 for any event A)
For two mutually exclusive events A and B
P(A U B) = P(A) + P(B)
Data Analytics 8
GyanData Private Limited
Conditional Probability
If two events A and B are not independent, then information
available about the outcome of event A can influence the
predictability of event B
Conditional probability
◦ P(B | A) = P(AB)/P(A) if P(A) > 0
◦ P(A | B)P(B) = P(B | A)P(A) - Bayes formula
◦ P(A) = P(A | B)P(B) + P(A | Bc)P(Bc)
Example: two (fair) coin toss experiment TH
◦ Event A : First toss is head = {HT, HH} HT HH
◦ Event B : Two successive heads ={HH} TT
◦ Pr(B) = 0.25 (no information)
◦ Given event A has occurred Pr(B|A) = 0.5 = 0.25/0.5 = P(AB)/P(A)
Data Analytics 10
GyanData Private Limited
Example
In a manufacturing process 1000 parts are produced of
which 50 are defective. We randomly take a part from
the day’s production
◦ Outcomes : {A=Defective part B = Non-defective part}
◦ P(A) = 50/1000, P(B) = 950/1000
Suppose we draw a second part without replacing the
first part
◦ Outcomes : {C=Defective part D = Non-defective part}
◦ P(C | A) = 49/999
◦ P(C) = 49/999*50/1000 + 50/999*950/1000 = 50/1000
◦ P(A | C) = P(A C)/P(C) = P(C | A)P(A)/P(C) = 49/999
Data Analytics 11
GyanData Private Limited
Random Variable
A random variable (RV) is a map from sample space to a real line such
that there is a unique real number corresponding to every outcome of
sample space
◦ eg. Coin toss sample space [H T] mapped to [0 1]. If the sample space outcomes are real
valued no need for this mapping (eg. throw of a dice)
◦ Allows numerical computations such as finding expected value of a RV
◦ Discrete RV (throw of a dice or coin)
◦ Continuous RV (sensor readings, time interval between failures)
◦ Associated with the RV is also a probability measure
Data Analytics 12
GyanData Private Limited
Data Analytics 14
GyanData Private Limited
Data Analytics 15
GyanData Private Limited
◦ Density is characterized by
parameter n (degrees of
freedom)
◦ Distribution of sum of squares
of n independent standard
normal RVs
◦ Distribution of sample variance
Data Analytics 16
GyanData Private Limited
Moments of a pdf
Similar to describing a function using derivatives, a
pdf can be described by its moments
◦ For continuous distributions
∞
𝐸 𝑥 𝑘 = −∞ 𝑥 𝑘 𝑓 𝑥 𝑑𝑥
◦ For discrete distributions
𝐸 𝑥 𝑘 = σ𝑁 𝑘
𝑖=1 𝑖 𝑝(𝑥𝑖 )
𝑥
Mean : 𝜇 = 𝐸[𝑥]
Variance : 𝜎 2 = 𝐸 𝑥 − 𝜇 2 = E 𝑥 2 − 𝜇2
Standard deviation = Square root of variance =
Data Analytics 17
GyanData Private Limited
Data Analytics 18
GyanData Private Limited
Data Analytics 19
GyanData Private Limited
Data Analytics 20
GyanData Private Limited
Structure of Σ
𝜎𝑥21 𝜎𝑥1 𝑥2 ⋯ 𝜎𝑥1 𝑥𝑛
𝜎𝑥2 𝑥1 𝜎𝑥22 ⋯
Σ=
⋮ ⋮ ⋮
𝜎𝑥𝑛 𝑥1 2
⋯ ⋯ 𝜎𝑥𝑛
Data Analytics 21
GyanData Private Limited
𝑁−1 𝑛
◦ 2
𝑇𝑛,𝑁−𝑛 = 𝐹𝑛,𝑁−𝑛
(𝑁−𝑛)
Data Analytics 22
GyanData Private Limited
SAMPLE STATISTICS
Data Analytics 23
GyanData Private Limited
Data Analytics 24
GyanData Private Limited
Basic Concepts
Population: Set of all possible outcomes of a
random experiment characterized by f(x)
Sample set (realization) : Finite set of observations
obtained through an experiment
Inference: Conclusion derived regarding the
population (pdf, parameters) from the sample set
◦ Inference made from a sample set is also uncertain since it
depends on the sample set which is one of many possible
realizations
Data Analytics 25
GyanData Private Limited
Statistical Analysis
Descriptive Statistics (Analysis)
◦ Graphical : Organizing and presenting the data (eg. box
plots, probability plots)
◦ Numerical: Summarizing the sample set (eg. mean, mode,
range, variance, moments)
Inferential
◦ Estimation: Estimate parameters of the pdf along with its
confidence region
◦ Hypotheses testing: Making judgements about f(x) and its
parameters
Data Analytics 26
GyanData Private Limited
Data Analytics 27
GyanData Private Limited
Data Analytics 28
GyanData Private Limited
[55 55 59 60 63 65 66 67 67 67 71 71 72 73 75 75 78 81 82 83]
Data Analytics 29
GyanData Private Limited
Measures of Spread
Represents spread of sample set
1
◦ Sample variance : 𝑠 =
2 σ𝑁 (𝑥 − 𝑥)ҧ 2
𝑁−1 𝑖=1 𝑖
Unbiased estimate of population variance : E[𝑠 2 ] = 𝜎 2
Standard deviation is sqrt of variance
1 𝑁
◦ Mean absolute deviation : 𝑑 = 𝑁 σ𝑖=1 𝑥𝑖 − 𝑥ҧ
ҧ
◦ Range : 𝑅 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
◦ Eg. Sample heights of 20 cherry trees
s2 = 70.5132 and 212.25 with outlier
s = 8.392 (population std used for generating numbers was 10)
MAD = 6.85 and 9.5 with outlier
Range = 83 - 55 = 28
Data Analytics 30
GyanData Private Limited
Data Analytics 31
GyanData Private Limited
Data Analytics 32
GyanData Private Limited
Data Analytics 33
GyanData Private Limited
Data Analytics 34
GyanData Private Limited
Data Analytics 35
GyanData Private Limited
HYPOTHESES TESTING
Data Analytics 36
GyanData Private Limited
Data Analytics 37
GyanData Private Limited
Hypotheses Testing
The hypotheses is generally converted to a test of the
mean or variance parameter of a population (or
differences in means or variances of populations)
A hypothesis a statement or postulate about the
parameters of a distribution (or model)
◦ Null hypothesis H0 : The default or status quo postulate that
we wish to reject if the sample set provides sufficient
evidence (eg. = 0)
◦ Alternative hypothesis H1 : The alternative postulate that is
accepted if the null hypothesis is rejected (eg. < 0)
Data Analytics 38
GyanData Private Limited
Data Analytics 39
GyanData Private Limited
Data Analytics 41
GyanData Private Limited
Truth
H0 is true Correct Decision Type I error
Pr = 1 - Pr =
H1 is true Type II error Correct Decision
Pr = Pr = 1 -
Data Analytics 42
GyanData Private Limited
Threshold
Distribution if
Distribution if null
alternative hypothesis
hypothesis is true
is true
Data Analytics 43
GyanData Private Limited
Data Analytics 44
GyanData Private Limited
Data Analytics 45
GyanData Private Limited
Data Analytics 46
GyanData Private Limited
◦ Test statistic (assuming unknown but equal variances for two groups)
𝑠12
f= ~F(𝑁1 − 1, 𝑁2 − 1); f = 0.27
𝑠22
Data Analytics 47
GyanData Private Limited
chi-square test Sum of p independent Test for variance Test quality of regression
(p degrees of standard normal model
freedom) variables
F-test (p1 and p2 Ratio of two chi- Test for comparing Choose between
degrees of square variables variances of two groups regression models having
freedom) different number of
parameters
Data Analytics 48
GyanData Private Limited
End of session
Data Analytics 49