statistics I
mm1: introduction and
distributions of sampling statistics
petar popovski
assistant professor
antennas, propagation and radio networking (APNET)
department of electronic systems
aalborg university
e-mail: [email protected]
lecture outline
introduction
descriptive statistics
– description and summarization of data sets
– chebyshev’s inequality and the weak law of large numbers
– normal data sets
– sample correlation coefficient
distributions of sampling statistics
– sample mean and variance
– central limit theorem
– sampling distribution from a normal population
2 / 22
introduction
statistics: collection/description/analysis of data and
inference
– the term first appeared in 1770, in relation to the collection of
facts of interest to the state
– homer simpson on statistics: “oh, people can come up with
statistics to prove anything, Kent. 14% of people know that.”
descriptive and inferential statistics
– reasonable conclusions can be obtained by assuming certain
probability model for the data
populations and samples
– a sample should be representative of that population
– why the random sample is good?
3 / 22
description of data sets (1)
frequency tables and graphs
relative frequency tables and graphs
4 / 22
description of data sets (2)
histograms
– bins, class intervals, left-end inclusion convention
– the histogram can be used to approximate the continuous
probability density functions (pdf)
5 / 22
description of data sets (3)
ogive=cumulative
frequency plot
– to approximate the
cumulative
distribution function of
given pdf
stem and leaf
daily minimum temperatures (in F)
6 / 22
sample mean, median and mode
1 n
sample mean x = ∑ xi
n i =1
linear property calculation with frequencies
∀i, yi = axi + b ⇒ y = ax + b k
v f relation to the
x =∑ i i mean value of
i =1 n random variable
sample median
– if n is odd, it is the (n+1)/2-th value
– if n is even, it is the average of the values in positions n/2 and n/2+1
mean vs. median
– when they are expected to be the same?
sample mode = the most frequent value in the set
7 / 22
sample variance and standard deviation
variance standard deviation
1 n 1 n
s =
2
∑ i
n − 1 i =1
( x − x ) 2
s= ∑
n − 1 i =1
( xi − x ) 2
n-1 instead of n due to unbiased estimation
algebraic identity linear property
n n
∑ i
( x
i =1
− x ) =2
∑ i
x 2
− x
i =1
2
∀i, yi = a + bxi ⇒ s y2 = b 2 s x2
example
203 − 9( 359 )
2
s =
2
= 8.361
8
8 / 22
sample percentiles
to determine the sample 100p percentile, where 0≤p≤1,
for a data of size n, we need to find the value that:
– at least np of the values are less than or equal to it
– at least n(1-p) of the values are greater than or equal to it
example
let the population size be n=33. the sample 10 percentile is the 4th
smallest value, since ⎡33·0.1⎤=4
quartiles
– first (25%), second (50%), third (75%)
9 / 22
chebyshev’s inequality
(
for any value of k ≥ 1 more than 100 1 − k12 ) percent of
the data lie within the interval
(x − ks, x + ks )
– it is universal, but therefore the bound can be loose
probability-version of the chebyshev’s inequality
weak law of large numbers
10 / 22
normal and skewed data sets
normal histogram approximately normal
skewed to the left the empirical rule
for approximately normal data sets
1. Approx. 68% of observations lie within
x±s
2. Approx. 95% of observations lie within
x ± 2s
3. Approx. 99.7% of observations lie within
x ± 3s
11 / 22
sample correlation coefficient
the statistical data can be given as pairs of values and we want to
find if there is a relation between those values
sample correlation coefficient
n
∑(x(x− −x )(x )(y y− −y )y )
n
r =∑
i i
i =1 i i =
r= i =1 (n − 1) s x s y =
(nn − 1) s x s y
n
∑ (xi − x )( yi − y )
= ∑i =(1xi − x )( yi − y )
=
∑(x(x− −x )x ) ∑(y(y− −y )y )
in=1 n
2 2
n n
∑ ∑
i 2 j 2
i =1 i j =1 j
i =1 j =1
measures association, not causation
12 / 22
distribution of sampling statistics
sampling
if X 1 , X 2 ,L X n are independent random variables having
a common distribution F , then they constitute a
sample (or random sample) from the distribution F .
types of inference problems
– parametric = F is known up to the values of some parameters
– non-parametric = nothing is assumed about the form of F
we now define statistic as a random variable whose
value is determined by the sample data
– our goal is to examine the properties of this random variable
Y = f ( X 1 , X 2 ,L X n )
14 / 22
sample mean
we suppose that the value of any population member
can be as a random variable with
expectation (population mean) variance (population variance)
μ σ2
sample mean for the sample of values X 1 , X 2 ,L X n
n
∑X i
X= i =1
n
σ2
E [X ] = μ Var (X ) =
n
15 / 22
central limit theorem (1)
a fundamental result in probability theory
– the theorem is very powerful, as the distribution of the variable
Xi can have a general form, it is only required to have a finite
mean and variance
from this theorem it follows that the variable Z is a
standard normal random variable
n
∑X
z2
− nμ 1 −
i
pZ ( z ) = e 2
Z= i =1
2π
σ n
16 / 22
central limit theorem (2)
example
n
⎡ n ⎤
E ⎢∑ X i − W ⎥ = 3n − 400 ∑X i − W − (3n − 400)
⎣ i =1 ⎦ Z= i =1
⎛ n ⎞ 0.09n + 1600
Var⎜ ∑ X i − W ⎟ = 0.09n + 1600
⎝ i =1 ⎠
400 − 3n
≤ 1.28 ⇒ n ≥ 117
0.09n + 1600
17 / 22
central limit theorem (3)
an important application of the central limit theorem is
for binomial random variables
⎧1 with prob. p
X = X1 + X 2 + L + X n Xi = ⎨
⎩0 with prob. (1 − p )
X is a random variable that represent the number of
successes in n trials, where the probability of success
in each trial is p
E [X i ] = p; Var ( X i ) = p (1 − p )
X − np
the central limit theorem states that Z =
np (1 − p )
is approximately a standard normal random variable
see problem 15 of chapter 6
18 / 22
sample variance
the sample variance is a statistics defined as
n
∑ i
( X − X ) 2
S2 = i =1
n −1
by using (n-1) in the denominator we obtain
[ ]
E S2 =σ 2
19 / 22
sampling from normal population (1)
let X 1 , X 2 ,L X n be a sample from a normal population
X i ~ N (μ ,σ 2 )
then the sample mean is a normal random variable with
(
X ~ N μ , σn
2
)
to find the distribution of the sample variance
n
∑ i
( X − X ) 2
S2 = i =1
n −1
recall the chi-square distribution
– Y = Z1 + Z 2 + L Z n has chi-square distribution with n degrees
2 2 2
of freedom if each Z i is standard normal
20 / 22
sampling from normal population (2)
(n − 1) S 2
the variable σ2 has a chi-square distribution
with n-1 degrees of freedom
recall the t-distribution with n degrees of freedom as a
distribution of Z
χ n2
n
X −μ
then it follows that n
S
has a t-distribution with n-1 degrees of freedom
21 / 22
sampling from a finite population
random sample of population of N elements
⎛N⎞
– each of the ⎜⎜ ⎟⎟ subsets is equally likely to be sample
⎝n⎠
consider the case where the fraction p of the
population has some feature i. e. in total Np elements
– let Xi be the indicator variable
X = X1 + X 2 + L + X n
note that now X 1 , X 2 ,K X n are not independent
however, if N >> n , then the distribution of X has
approximately the features of binomial r.v. with n and p
22 / 22