Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views46 pages

Tuesday, 16 January 2024 2:58 PM

Uploaded by

faxogaj590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views46 pages

Tuesday, 16 January 2024 2:58 PM

Uploaded by

faxogaj590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Lecture 1

Tuesday, 16 January 2024 2:58 pm

Data - collection of facts and figures (qualitative and quantitative)


Statistics is the science of variation
Descriptive stats
Intro to stats

Mean
Median
Mode

Visualisations
Graph - quantitative data - histogram, frequency curve, scatter
plot, stem and leaf
Chart - qualitative data - bar, pie

Predictive stats

Box plot is a check of normality - same for histogram


Why do you need to check for normality - SI - normally distributed data
is necessary for inferential tests and procedures

Steps for predictive modelling (to build a predictive model)

Step 1: Objective definition/problem identification

𝐻 (𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠): 𝜇 = 𝜇
𝐻 : (𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠) 𝜇 < 𝜇

Females complaining that avg salary of males is much higher than that of
females

Some additional steps for machine learning


Step 2: Collection of data (SI in broad perspective) - predictive analysis

Step 3: Data cleaning - formatting issues or missing values

Step 4: Training data (70% or 80%) and testing data (30% or 20%)
(major step in ML) - need training data file to create model then use
testing data to verify the accuracy of the model then deploying model

Step 5: Propose model

Step 6: Model learning

Step 7: Model testing

Step 8: Deploy model if it is accurate


If want to predict income of an individual - income is dependent variable
Factors (independent variables): education, skills, experience, industry,
gender, …

Randomly take 70% records and store in a file called training data. The
rest of the 30% go to a file called testing data
Propose a model according to data

Make scatter plots, if linear then use a linear model, if quadratic use that

If incorrect model used, testing isn't accurate, restart from step 5,


propose different model

Linear model, quadratic model, exponential model

Regression model, classification model, logistic regression model - ML


models

Supervised/unsupervised learning

COMPLETE
Lecture 2
Thursday, 18 January 2024 4:05 pm

Population, sample, census, sampling, parameter and statistic

Population vs sample: population is whole group (universal set) (complete set of all entities in a
particular boundary or timeframe), sample is a part of the group (subset)

Census and sampling are used for data collection

Parameter - calculating something for a population


Statistic - calculating something for a sample

Population of all account holders of HBL bank from the inception of the bank to today - space bound
by name of bank - doesn't require census cause data is already digitally present in the system

Population, all videos on YT from Pak, bound by space


Subset, my videos on YT

Each video is an element, one member of universal set

Characteristics/Features/Variables of a video: video id, duration, date/time of upload,


subscriptions, view count

One variable - uni-variate


Multiple variables - multi-variate

Nominal, ordinal, scaled - measurement scales


Scaled (ratio and interval) are much more accurate (because uses numbers) than nominal and are
therefore superior (nominal is just categories)
Ordinal comes from order - or sorting - categories are such that there's superior and inferior
Nominal - lowest predictive sense, least priority in data structures, string only (can be numbers but
not those that can be used for calculation)

Height is quantitative, continuous but in excel sheet written in discrete form

Discrete - countable
Continuous - measurable that needs machinery or tools to check

Strings can be ignored as far as predictive powers are concerned. Categorical variables still have
some sort of minimal predictive power. Ordinal are best in qualitative for predictive analytics.

Height of 50 students - population file


𝜇 → 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
𝑥̅̅ → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛

𝜎 → 𝑝𝑜𝑝 𝑣𝑎𝑟
𝑠 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟

𝑃 → 𝑝𝑜𝑝 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛
𝑝 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛

𝜌 → 𝑝𝑜𝑝 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝑟 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛

irl, population is unknown - so we assume sample unless stated that it is population

Descriptive stats (ITS)


1. Data collection
2. Classification
3. Analysis/descriptive
4. Visualisation
5. Probability distribution (b to e are continuous dist)
a. Uniform
i. Binomial
ii. Poisson
iii. Geometric
iv. Hyper geometric
b. Normal
c. t-dist
d. f-dist
e. 𝑥 dist

Inferential (predictive) - 3 main topics of SI


1. Estimation of parameter
2. Testing of hypothesis
3. Model building

𝐻 : 𝜇 = 75000
𝐻 : 𝜇 ≠ 75000

SPSS - R-language and R-Studio


SPSS works on clicks
R is command based

Download for next class: R and R studio from cran website

R is the core language and R studio is a wrapper over R?

The console is the actual R interpreter


The top left window is the script, all commands in one file and execute all at once or do individually
in bottom left window. Script writing is better

Inferential stats: sample ko dekh kar population parameter ko predict karna ya estimate karna ya
test the parameter

The sample should be random sample


Lecture 3
Tuesday, 23 January 2024 4:07 pm

Census: collecting data from entire population


Sampling: collecting data from random people in the population

Simple random sampling (SRS): every member of the population has a


chance to be included in the sample (probability however maybe
different for each person of being selected)

Sampling with replacement or without replacement


Depends on situation, which one to use

𝑁
All possible samples without replacement =
𝑛

All possible samples with replacement 𝑁

For random sampling in excel:


rand() -> 0 to 1 -> draws numbers from a uniform distribution
Randbetween(min ,max) -> 1 to 50 in our case

∑𝑋
𝜇 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 =
𝑁

∑(𝑋 − 𝜇)
𝜎 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑀𝑆𝐷 =
𝑁

Take distance from mean


Square distance
Take mean of these

Parameters - central point of the whole course - estimation of


population parameter
Mean - 𝜇
Variance - 𝜎
Proportion - 𝑃
Correlation - 𝜌

Parameters are always constants

∑𝑥
𝑥̅ =
𝑛

∑(𝑥 − 𝑥̅ )
𝑠 =
𝑛−1
Why do we use n-1 in sample variance?

Sample mean is an estimate of population mean. Sample variance is an


unbiased estimator of population variance if it is computed by n-1

𝜇̂ = 𝑥̅

𝜎 =𝑠

Stdev.s - sample st dev


Stdev.p - population st dev

Var.s - sample variance


Var.p - population variance

Checking the normality of a dataset

How to check (graphical methods):


1. Histogram
2. Box plot
3. qq plot
4. Stem and leaf
5. Empirical theorem - Chebyshev theorem

Tests of normality:
1. Kolmogrov Smirnov test
2. Shapero test

Estimation of parameter, testing of hypothesis and predictive modelling


require normality of data

Symmetric, skewness = 0

Skewness +ve

Skewness -ve

Review content before coming to next class


lect1: intro file
lect1: recorded video
Add Notes 1
Thursday, 25 January 2024 11:15 am

Chapter 1

The elements in a population have at least one factor of homogeneity, but with other segments we
can bifurcate into segments (so also have heterogenous elements). The number of elements in a
population can be finite or infinite

Census: complete enumeration, collect data and analyse the whole population

Random sample - everyone has a chance of being selected, not necessarily equal probability for
everyone
Simple random sample - random sample with equal probability for everyone

Parameter - characteristic of population


Statistic - characteristic of sample

In stratified sampling, bifurcate population with respect to the heterogenous factor, select sample
from each segment using simple random and then combine them all for a final sample
Equal allocation vs proportional allocation (proportional to size of stratum)

In systematic sampling, divide into segments by size, nothing to do with characteristics

For example 1000 pages in a book, divide into 10 segments according to page number. Take the
first page number randomly, for eg 97. After that take page numbers systematically for example
197, 297, …, 997. First element is random, the rest are systematic in sequence. Segments always of
equal size

Proportion is mainly for qualitative characteristics but can also be used on quantitative - it is
basically probability

Quantitative techniques can be used on only quantitative discrete or numerical data but are more
accurate
Qualitative techniques can be used on both quantitative and qualitative data but are less accurate

Chapter 7

m = number of samples
m = N^n or NCn

Sampling distribution of mean is a table which is comprised on all values of sample means along
with their probabilities such that the total of probabilities will be equal to 1. It is a probability
distribution of a statistic.

Sampling error is the difference between the actual value and estimated value of a parameter 𝑥̅ − 𝜇.
It is reduced if we increase sample size. One more method is to use most appropriate sampling
method (simple random for homogenous population etc)
Sampling error is the error resulting from using a sample to estimate a population characteristic
(parameter)

Standard error, is the standard deviation of the sampling distribution of mean. Reduce by increase
sample size.

Standard error with replacement 𝜎 ̅ = 𝜎/√𝑛


Standard error without replacement 𝜎 ̅ = .

Finite population correction (fpc) factor:

Fpc value is neglected when population size is very high. So either N is large or N is small but
sampling is done with replacement then its considered a large one and performance of the
population is considered as that of an infinite population

If ≤ 0.05 → don't really need FPC

Using table:
𝑚𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚. 𝑑𝑖𝑠𝑡 𝑜𝑓 𝑚𝑒𝑎𝑛 = 𝜇 ̅ = 𝐸(𝑥̅ ) = 𝑥̅ 𝑃(𝑥̅ )

𝜎 ̅ = 𝑉(𝑥̅ ) = 𝑥̅ 𝑃(𝑥̅ ) − 𝑥̅ 𝑃(𝑥̅ )

Kurtosis and chebyshev review. Empirical rule


Lecture 4
Thursday, 25 January 2024 4:02 pm

Quiz 1 based on chapter 1 and 7

Simple random - each element has some probability, not necessarily


equal, of being selected in the sample
Stratified sampling - distribute using characteristic into categories,
select from each category using simple random - increases accuracy of
estimates

𝑁 = 𝑛𝑜. 𝑜𝑓 𝑔𝑖𝑟𝑙𝑠, 𝑁 = 𝑛𝑜. 𝑜𝑓 𝑏𝑜𝑦𝑠


𝑁 = 𝑁 + 𝑁 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠
𝑁 = 50 → 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

𝑛 = 5, 𝑛 = 3, 𝑛 =2

Equal allocation - total sample will be distributed equally among


segments
Proportional allocation - according to proportion of each segment
in population

Unweighted mean - to find final mean of all segments when using


equal allocation:

𝑥 +𝑥 +𝑥
𝑥̅ =
3

Weighted mean - to find final mean of all segments when using


proportional allocation:

𝑛 𝑥 +𝑛 𝑥 +𝑛 𝑥
𝑥̅ =
𝑛 +𝑛 +𝑛

Truncated mean - remove extra stuff (basically outliers)

Chapter 7 - sampling distribution of mean:

Raw data: just recorded observations


Frequency distribution: frequency table intervals, tally, frequencies

x f
150-200 5
200-250 9
250-300 3
⋮ ⋮
50

Probability distribution:

x P(x)
175 5/50
225 9/50
275 3/50
⋮ ⋮
50/50 = 1

Sampling distribution of mean: Probability distribution of sample means


Sampling with replacement means, infinite samples possible

Numerical example: for understanding and defining theorems

A. Draw all possible random samples of size 3 by with replacement


from the following population:
Temperatures in °𝐶: 1, 3, 5

𝑁 = 𝑝𝑜𝑝 𝑠𝑖𝑧𝑒 = 3
𝑛 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 = 3
By: with replacement

𝑚 = 𝑁 = 3 = 27

Draw tree diagram for ease of creating combinations

S.No Sample Mean 𝑥̅


1 1, 1, 1 3/3
2 1, 1, 3 5/3
3 1, 1, 5 7/3
4 1, 3, 1 5/3
5 1, 3, 3 7/3
6 1, 3, 5 9/3
7 1, 5, 1 7/3
8 1, 5, 3 9/3
9 1, 5, 5 11/3
10 3, 1, 1 5/3
11 3, 1, 3 7/3
12 3, 1, 5 9/3
13 3, 3, 1 7/3
14 3, 3, 3 9/3
15 3, 3, 5 11/3
16 3, 5, 1 9/3
17 3, 5, 3 11/3
18 3, 5, 5 13/3
19 5, 1, 1 7/3
20 5, 1, 3 9/3
21 5, 1, 5 11/3
22 5, 3, 1 9/3
23 5, 3, 3 11/3
24 5, 3, 5 13/3
25 5, 5, 1 11/3
26 5, 5, 3 13/3
27 5, 5, 5 15/3

B. Find the mean of each sample - done above

C. Construct a sampling distribution of mean

𝑥̅ f 𝑃(𝑥̅ ) 𝑥̅ 𝑃(𝑥̅ ) 𝑥̅ 𝑃(𝑥̅ )


3/3 1 1/27 3/81 9/243
5/3 3 3/27 15/81 75/243
7/3 6 6/27 42/81 294/243
9/3 7 7/27 63/81 567/243
11/3 6 6/27 66/81 726/243
13/3 3 3/27 39/81 507/243
15/3 1 1/27 15/81 225/243
total 27 27/27 = 1 243/81 = 3 2403/243

D. Find population mean and variance

1+3+5
𝜇= =3
3

(1 − 3) + (3 − 3) + (5 − 3) 8
𝜎 = =
3 3

E. Find mean and variance of sampling distribution of mean

243 𝐸(𝑥̅ ) → 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒


𝑚𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚. 𝑑𝑖𝑠𝑡 𝑜𝑓 𝑚𝑒𝑎𝑛 = 𝜇 ̅ = 𝐸(𝑥̅ ) = 𝑥̅ 𝑃(𝑥̅ ) = =3
81

2403 8
𝜎 ̅ = 𝑉(𝑥̅ ) = 𝑥̅ 𝑃(𝑥̅ ) − 𝑥̅ 𝑃(𝑥̅ ) = −3 =
243 9
F. Compare the population mean with the mean of the sampling
distribution of mean

𝜇=𝜇 ̅

Mean of sampling distribution of all sample means is always equal


to population mean - very imp theorem
With or without replacement

G. Compare the population variance with the variance of the


sampling distribution of mean, and make conclusions

𝜎
𝜎̅ =
𝑛
𝜎
𝜎̅=
√𝑛

Population variance is always equal to the variance of sampling


distribution of sample means, divided by the sample size
Only for with replacement

H. Determine the shape of the sampling distribution of means - slide


32 is all important conclusions from chapter
a. The shape of the sampling distribution of mean is normal if
population is normal
b. The shape of the sampling distribution of means is
approximately normal if the sample size is large (n > 25)
and the population is not normal - Central Limit Theorem
c. We can't determine the shape of the sampling distribution of
means if the sample size is small and the population is not
normal - no conclusion

𝑥̅ → 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑠 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠

𝜎 ̅ → 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 → 𝑠𝑑 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛


Lecture 5
Thursday, 1 February 2024 4:00 pm

Probability distribution (p.d): a table containing two columns x and P(x)

Sampling distribution (s.d): It is the p.d of 𝑥̅ . Probability distribution of statistic 𝑥̅ . Two columns
𝑥̅ and 𝑃(𝑥̅ )

i. 𝜇 ̅ = 𝜇
𝜎
ii. 𝜎 ̅ =
√𝑛
iii. If 𝑋 ~ 𝑁 (X follows normal distribution) then 𝑥̅ too
iv. If 𝑋 ~ 𝑎𝑛𝑦 (X follows any dist.) then 𝑥̅ ≈ 𝑁 (approximately normal) provided 𝑛 ≥ 25 →
central limit theorem

𝑥−𝜇
𝑧= → 𝑧 − 𝑠𝑐𝑜𝑟𝑒 𝑟𝑒𝑔𝑢𝑙𝑎𝑟
𝜎

For CLT:

𝑥̅ − 𝜇 ̅ 𝑥̅ − 𝜇
𝑧= =
𝜎̅ 𝜎/√𝑛

CLT questions from handout:

Qs 2: Let’s say that 80% of all business startups in the IT industry report that they generate a
profit in their first year. If a sample of 10 new IT business startups is selected, find the probability
that exactly seven will generate a profit in their first year. Find the probability that at least 4 will
generate a profit in their first year. Find the probability that between 3 to 5 startups will be
successful

𝑥 → # of startups which are successful out of 10


𝑛 = 10, 𝑝 = 0.8

If 𝑛 < 20 use binomial. Otherwise it gets approximated to Poisson or Normal

𝑃(𝑥 = 7):

10
0.8 0.2 = 0.201
7

In R: dbinom(7, 10, 0.8)

𝑃(𝑥 ≥ 4):

= 1 − 𝑃(𝑥 ≤ 3) = 1 − [𝑃(𝑥 = 0) + 𝑃(𝑥 = 1) + 𝑃(𝑥 = 2) + 𝑃(𝑥 = 3)]


= 1 − 0.001 = 0.999
In R: 1 - pbinom(3, 10, 0.8)

𝑃(3 ≤ 𝑥 ≤ 5):

= 𝑃(𝑥 = 3) + 𝑃(𝑥 = 4) + 𝑃(𝑥 = 5)


= 𝑃(𝑥 ≤ 5) − 𝑃(𝑥 ≤ 2)

𝑛
𝑃(𝑋 = 𝑥) = 𝑝 (1 − 𝑝)
𝑥

4 methods for binomial: manual, table, pbinom and dbinom

Use cumulative table unless p does not lie in the table. In that case, manual work

Qs 3: An auditor takes a random sample of size 36 from a population of size 1,000 accounts
receivable. The mean value of the accounts receivable for the population is $260.00, with the
population standard deviation $45.00.
(using with replacement) - Related to sampling distribution of mean
(a) What is the probability that the sample mean will be less than $250.00?
(b) What is the probability that the sample mean will be within $15.00 of the population mean?

We don't know the distribution of X but n=36>25 so approximately normal using CLT
𝜇 = $260, 𝜎 = $45

(a) 𝑃(𝑥̅ < 250)

Because population is unknown but the sample size is large, therefore we can use
normal distribution to find out this probability
𝑥̅ − 𝜇 ̅ 𝑥̅ − 𝜇
𝑧= =
𝜎̅ 𝜎/√𝑛

𝑥̅ − 𝜇 250 − 𝜇 250 − 260


𝑃 𝜎 < 𝜎 =𝑃 𝑧<
45
√𝑛 √𝑛 √36

𝑃(𝑧 < −1.33)

From table: 0.0918

In R: pnorm(-1.33, 0, 1) -> z, mean, sd


pnorm(-1.33)

(b) 𝑃(𝜇 − 15 < 𝑥̅ < 𝜇 + 15)

𝜇 − 15 − 𝜇 𝑥̅ − 𝜇 𝜇 + 15 − 𝜇
=𝑃 𝜎 < 𝜎 < 𝜎
√𝑛 √𝑛 √𝑛

−15 15
=𝑃 <𝑧< = 𝑃(−2 < 𝑥̅ < 2)
45 45
√36 √36

= 𝑃(𝑧 < 2) − 𝑃(𝑧 < −2)

= 0.9772 − 0.0228 = 0.9544

After conversion to z, the mean is always 0 and standard deviation is 1

Proof of variance of z form is always 1:

𝑥−𝜇
𝑧=
𝜎
𝑥−𝜇
𝑣𝑎𝑟(𝑧) = 𝑣𝑎𝑟
𝜎
𝑥 𝜇 𝑥 𝜇
=𝑣 − =𝑣 −𝑣
𝜎 𝜎 𝜎 𝜎

𝑥 𝑣(𝑥)
=𝑣 −0= −0
𝜎 𝜎

𝜎
−0=1
𝜎

Qs 4: The mean selling price of senior condominiums in Green Valley over a year was $215,000.
The population standard deviation was $25,000. A random sample of 100 new unit sales was
obtained.
a. What is the probability that the sample mean selling price was more than $210,000?
b. What is the probability that the sample mean selling price was between $213,000 and $217,000?
c. What is the probability that the sample mean selling price was between $214,000 and $216,000

𝜇 = 215000, 𝜎 = 25000
𝑛 = 100 > 25

(a) 𝑃(𝑥̅ > 210,000)

210000 − 215000
𝑃 𝑧> = 𝑃(𝑧 > (−0.632))
25000
√100

With replacement standard error:


𝜎
𝜎̅=
√𝑛

Without replacement standard error:

𝑁−𝑛 𝜎
𝜎̅= .
𝑁 − 1 √𝑛

The formula of z-score also differs accordingly


Lecture 6
Tuesday, 6 February 2024 12:00 pm

Missed lecture from last Tuesday

Sampling distribution of proportions

𝜇 =𝑃

𝑃(1 − 𝑃)
𝜎 =
𝑛

𝑝̂ is approximately normally distributed for large n - CLT of proportion

Z-score:

𝑥−𝜇 𝑝̂ − 𝑃
𝑧= → → 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛
𝜎 𝑃(1 − 𝑃)
𝑛

CLT sampling dist of mean and proportion file:

P = 0.6, n = 100

𝑃(𝑃 − 0.05 < 𝑝 < 𝑃 + 0.05) = 𝑃(0.55 < 𝑝 < 0.65)

Sample size is large so 𝑝̂ is approximately normally distirbuted and therefore we can use z-score
method to solve this

0.55 − 0.6 𝑝−𝑃 0.65 − 0.6


𝑃 < < = 𝑃(−1.021 < 𝑧 < 1.021)
0.6(0.4) 𝑃(1 − 𝑃) 0.6(0.4)
100 𝑛 100

0.8461 − 0.1539 = 0.6922


𝑃(𝑝 − 𝑃 > ? ) = 0.1

𝑃(𝑝 − 𝑃 > 𝑎) = 0.1

𝑃(𝑝 − 𝑃 ≤ 𝑎) = 1 − 0.1 = 0.9

Can use CLT so:

0.25(0.75)
𝑆𝐸 = 𝜎 =
120

𝑝−𝑃 𝑎
𝑃 < = 0.9
𝜎 𝜎

𝑎
𝑃 𝑧< = 0.9
𝜎

𝑎𝑡 𝑃 = 0.9, 𝑧 = 1.29

𝑃(𝑧 < 1.29) = 0.9

𝑎
= 1.29
0.25(0.75)
120

𝑎 = 0.051

In R: qnorm helps find z


pnorm helps find cumulative probability
dnorm helps find distinct probability

qnorm(0.9, 0, 1) = 1.281552

Standard error is the standard deviation of sampling distribution of sample mean/proportion

𝜎
𝜎̅= → 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
√𝑛
𝜎 𝑁−𝑛
𝜎̅= . → 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
√𝑛 𝑁 − 1

𝑃(1 − 𝑃)
𝜎 = → 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑛

𝑃(1 − 𝑃) 𝑁 − 𝑛
𝜎 = . → 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑛 𝑁−1
Lecture 7 Double slot extra class

Sunday, 11 February 2024 10:00 am

Chapter 8: estimation of parameter

Before mids (statistical inference focused): need to complete course upto analysis of variance
(ch 14) - anova is included in mid. After mids, predictive analytics

Statistics

Descriptive Inferential Predictive

Estimation of Testing of
parameter hypothesis

Inferential stats - statistic sai parameter tak jana

Sampling proportion is an estimate of population proportion

Estimation main parameter is completely unknown and we estimate it using sample statistics.
If party A says however that the proportion of their supporters is 0.6 and party B wants to
challenge it. We have some pre-defined knowledge of the parameter that needs to be checked so
this is hypothesis testing

Formulas and methods are almost same in both chapters

Estimation of parameter

Point estimation
𝜇̂ = 500 ℎ𝑜𝑢𝑟𝑠
Draw a sample and find mean
Sample mean is a point estimate of population mean

𝜇̂ = 𝑥̅
𝜎 =𝑠
𝑃=𝑝
𝜌=𝑟

𝑥̅ − 𝜇 = 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟
To reduce error, increase sample size, and ensure random sampling

confidence interval/interval estimate


450 < 𝜇̂ < 550 ℎ𝑜𝑢𝑟𝑠

Confidence interval: an interval of numbers obtained from a point estimate of a parameter


Confidence level: the confidence we have that the true value of a parameter lies within the
confidence interval (95% is usually selected, in between 90 and 99%, think or given in
questions)

Example 8.1:
𝑥̅ = 63.28 𝑈𝑆𝐷 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑠 = 63.08 𝑈𝑆𝐷 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑠 = 7.94 𝑈𝑆𝐷 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡. 𝑑𝑒𝑣

𝜇̂ = 63.28 𝑈𝑆𝐷 → 𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛


𝜎 = 63.08 𝑈𝑆𝐷 → 𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝜎 = 7.94 𝑈𝑆𝐷 → 𝑝𝑜𝑖𝑛𝑡 𝑒𝑠𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑡. 𝑑𝑒𝑣
The hat symbol represents the fact that this is an estimate. Don't forget it please

𝑥−𝜇
𝑧= → 𝑛𝑜𝑟𝑚𝑎𝑙 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝜎

𝑥̅ − 𝜇
𝑧= 𝜎 → 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑑𝑖𝑠𝑡 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
√𝑛

Make 𝜇 subject:

𝜎
𝑥̅ − 𝜇 = 𝑧.
√𝑛
𝜎
𝜇 = 𝑥̅ ± 𝑧.
√𝑛

The one- mean z-interval procedure:

𝜎 𝜎
𝑥̅ − 𝑧 . 𝑡𝑜 𝑥̅ + 𝑧 .
√𝑛 √𝑛

Assumptions: Figure out qq plot, kolmogrov test and shapero test (we will do in R)
1. Must be simple random sample Qq plot should be a straigt upward direction line
2. The population must be normal or sample size large (CLT)
3. 𝜎 known → unrealistic assumption in real life
KS test (Kolmogrov-Smirnoff Test) in R
For known population st.dev Shapero test

Example 8.3 maybe?

Population - all people in civilian labour force


Variable - age

All assumptions hold so can solve

Confidence interval 95%, 𝜎 = 12.1 years

The graphs show that the population is not normal, but because the sample size is large,
according to the central limit theorem, we can assume that the sampling distribution of sample
mean is approximately normal

𝑥̅ = 36.4 𝑦𝑒𝑎𝑟𝑠

1 − 𝛼 = 0.95
𝛼 = 0.05
𝛼
= 0.025 (𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑤𝑒 𝑓𝑖𝑛𝑑 𝑧)
2

𝑧 . = 1.96 (−𝑣𝑒 𝑜𝑛 𝑙𝑒𝑓𝑡, +𝑣𝑒 𝑜𝑛 𝑟𝑖𝑔ℎ𝑡)

33.0 < 𝜇̂ < 39.8

There is a 95% probability that the true mean of the population lies within the confidence
interval

Alternatively, if we take a 100 different samples, true mean will lie within 95% of those
confidence intervals. 5 intervals will be such that the true mean will lie outside the confidence
interval.
Different type of question would be to determine sample size

𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
𝜎
𝑧. = 𝑥̅ − 𝜇 = 𝐸 (𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟 𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟)
√𝑛

𝑧. 𝜎
𝑛= → 𝑟𝑜𝑢𝑛𝑑𝑒𝑑 𝑢𝑝 𝑡𝑜 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑤ℎ𝑜𝑙𝑒 𝑛𝑢𝑚𝑏𝑒𝑟
𝑥̅ − 𝜇

The one- mean t-interval procedure:

Assumptions:
1. Must be simple random sample
2. The population must be normal or sample size large (CLT)
3. 𝜎 unknown → realistic assumption in real life

𝑠 𝑠
𝑥̅ − 𝑡 . 𝑡𝑜 𝑥̅ + 𝑡 .
√𝑛 √𝑛

𝑑𝑓 = 𝜈(𝑛𝑒𝑢) = 𝑛 − 1 (𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚)

Read these slides so we can ask questions in next class

Population: all pickpocket offenses


Variable: losses in USD

The graphs show that the population is not normal, but because the sample size is large,
according to the central limit theorem, we can assume that the sampling distribution of sample
mean is approximately normal
Because the population st.dev is unknown, we are using t-test

𝑥̅ = 513.32 𝑈𝑆𝐷
𝑠 = 262.23 𝑈𝑆𝐷

𝛼
= 0.025
2

𝑑𝑓 = 𝜈 = 𝑛 − 1 = 24

𝑡 ;
=𝑡 . ; = 2.064

When we use s instead of 𝜎, there is an n-1, the same is the df

In the t-table, probabilities are outside (on top), t-scores are inside. In the z-table, however,
probabilities are inside and z-scores are outside

dt
pt
qt
rt

Q wale functions give quantile

qt(0.025, 24)
Chapter 9: hypothesis testing

The claim comes in alternate hypothesis (𝐻 ). The null hypothesis (𝐻 ) is unbiased, neutral etc

Start by assuming null hypothesis is true, then prove whether or not alternate is true

Steps (critical value approach - slide 33):


1. State null and alternative hypothesis
2. Decide on significance level, 𝛼
3. Compute the value of the test-statistic (z-test or t-test)
4. Determine the critical values (critical region/rejection region from table) (left/right/two
tailed)
5. Decision - If the value of the test statistic falls in the critical region (that is not in
acceptance region), reject null hypothesis, otherwise do not reject null hypothesis
6. Conclusion

𝑥̅ − 𝜇
𝑧=
𝜎/√𝑛

𝑥̅ − 𝜇
𝑧=
𝑠/√𝑛

Example 9.5

𝐻 : 𝜇 = 275 𝑦𝑎𝑟𝑑𝑠
𝐻 : 𝜇 < 275 𝑦𝑎𝑟𝑑𝑠

𝛼 = 0.05

𝑛 = 25
𝜎 = 20 𝑦𝑎𝑟𝑑𝑠 → 𝑧 − 𝑡𝑒𝑠𝑡

264.4 − 275
𝑧= = −2.65
20
√25

Critical region can be on the left, and on the right, or in between

Left tailed test

Right tailed test

Two tailed test

𝑧 . = −1.64
If z < -1.64, reject null hypothesis, otherwise accept

Because the z-statistic -2.65, is less than the critical value -1.64, therefore do not accept the null
hypothesis.

Conclusion: data provides sufficient evidence to conclude that mean driving distance of Jack is
less than 275 yards
Lecture 9
Tuesday, 13 February 2024 4:00 pm

Qs 1:
𝐻 : 𝜇 = 70 𝑦𝑒𝑎𝑟𝑠
𝐻 : 𝜇 > 70 𝑦𝑒𝑎𝑟𝑠

Hypothesis: a statement in which the parameter is stated

𝑥̅ = 71.8 𝑦𝑒𝑎𝑟𝑠
𝑛 = 100
𝜎 = 8.9 𝑦𝑒𝑎𝑟𝑠

𝛼 = 0.05

We use the z-test

𝑥̅ − 𝜇 71.8 − 70
𝑧 = = = 2.0225
𝜎/√𝑛 8.9/√100

𝑧 𝑓𝑜𝑟 𝑧 − 𝑜𝑏𝑠𝑒𝑟𝑣𝑒 𝑜𝑟 𝑧 𝑓𝑜𝑟 𝑧 − 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑

Right-tailed test because 𝜇 > 𝜇

𝑧 . = 1.645

Critical region: 𝑧 = [1.645, ∞)

𝑧 𝑓𝑜𝑟 𝑧 − 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙

Because 𝑧 is greater than 𝑧 therefore do not accept the null hypothesis

Conclusion: The data provides sufficient evidence not to accept the null hypothesis and conclude
that the mean life span today is greater than 70 years

Qs 7:

𝐻 : 𝜇 = 35 𝑚𝑖𝑛𝑠
𝐻 : 𝜇 < 35 𝑚𝑖𝑛𝑠

𝑥̅ = 33.1 𝑚𝑖𝑛𝑠
𝑛 = 20
𝑠 = 4.3 𝑚𝑖𝑛𝑠
𝑑𝑓 = 19

Assumption: if population sd unknown, then the samples sd can be used as a point estimate

𝛼 = 0.05
We use the t-test:

33.1 − 35
𝑡 = = −1.976
4.3/√20

𝑡 . = −1.729

𝑡 = (−∞, −1.729] → 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙

Because 𝑡 is lower than 𝑡 therefore do not accept the null hypothesis.

Conclusion: the data provides sufficient evidence to not accept the null hypothesis and conclude
that the average time taken to complete the standardized test is less than 35 minutes

The strategy of hypothesis testing we've been doing so far is critical value approach

Now going to do p-value approach

Steps (p-value approach - slide 34):


1. State null and alternative hypothesis
2. Decide on significance level, α
3. Compute the value of the test-statistic (z-test or t-test)
4. Determine the p-value, P
5. If 𝑃 ≤ 𝛼, reject 𝐻 ; otherwise, do not reject 𝐻
6. Conclusion - interpret result

Repeating both questions using p-value approach

Qs 1:

𝑥̅ − 𝜇 71.8 − 70
𝑧 = = = 2.0225
𝜎/√𝑛 8.9/√100

Check in z-table

𝑃 = 1 − 0.9783 = 0.0217

0.0217 < 0.05 so do not accept null hypothesis

Qs 7:

33.1 − 35
𝑡 = = −1.976
4.3/√20

Check in t-table, restrict to specific df row


𝑝 − 𝑣𝑎𝑙𝑢𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 0.05 𝑎𝑛𝑑 0.025 𝑠𝑜 𝑡𝑎𝑘𝑒 𝑎𝑣𝑔: 𝑃 = 0.0375

0.0375 < 0.05 so do not accept null hypothesis

If two-tailed p-value test, take from one side and multiply by 2 for P value
Lecture 10
Thursday, 15 February 2024 4:00 pm

Chapter 10:
Same 6 steps of hypothesis tests
Formula different

𝐻 : 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠 𝑖𝑛 𝑐𝑖𝑡𝑦 𝐴 𝑖𝑠 𝑠𝑎𝑚𝑒 𝑎𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑜𝑓


𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠 𝑖𝑛 𝑐𝑖𝑡𝑦 𝐵: 𝜇 = 𝜇 → 𝜇 − 𝜇 = 0
𝐻 : 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠 𝑖𝑛 𝑐𝑖𝑡𝑦 𝐵 𝑖𝑠 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠 𝑖𝑛 𝑐𝑖𝑡𝑦 𝐵: 𝜇 ≠ 𝜇

𝛼 = 0.05

𝐻𝑦𝑑𝑒𝑟𝑎𝑏𝑎𝑑: 𝑥 = 750, 𝜎 = 20
𝐾𝑎𝑟𝑎𝑐ℎ𝑖: 𝑥 = 780, 𝜎 = 25

Basic z-score (z-score is a variable, so is x. 𝜇 and 𝜎 are constants):

𝑥−𝜇
𝑧=
𝜎

In CLT z-score:

𝑥̅ − 𝜇
𝑧= 𝜎 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑥̅ 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
√𝑛
𝑥̅ − 𝜇
𝑧= 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑥̅ 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝜎 𝑁−𝑛
.
√𝑛 𝑁−1

𝑝̂ − 𝑃
𝑧= 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑝̂ 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑃(1 − 𝑃)
𝑛

𝑝̂ − 𝑃
𝑧= 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑝̂ 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑃(1 − 𝑃) 𝑁−𝑛
.
𝑛 𝑁−1

Hypothesis testing with 2 means:

(𝑥 − 𝑥 ) − (𝜇 − 𝜇 )
𝑧=
𝜎 𝜎
+
𝑛 𝑛

𝜎 𝜎

√𝑛 𝑛
For the question:

(750 − 780) − (0)


𝑧= = −5.13
20 25
+
30 30

𝑃 − 𝑣𝑎𝑙𝑢𝑒 ≈ 0
From R: 0.00000014487

z-ordinate ko q kehte hain in R (qq plot)

Because p value is approx 0, we reject null hypothesis and conclude that the null hypothesis is not
true
𝑃 ≤ 𝛼, 𝑠𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻

Conclusion: the data provides sufficient evidence to not accept the null hypothesis and conclude
that the performance of employees is significantly better in Karachi as compared to Hyderabad
(due to higher sample mean)

𝑖𝑓 (𝑥 − 𝑥 ) < 0 𝑡ℎ𝑒𝑛 𝜇 < 𝜇

Try with critical value test

This chapter is sampling distribution of difference of two means


1. 𝜇 ̅ ̅ = 𝜇 − 𝜇
𝜎 𝜎
2. 𝜎 ̅ ̅ = +
𝑛 𝑛

Qs from slide 8

𝜇 = 40, 𝜎 = 12
𝜇 = 40, 𝜎 = 6

𝑋 ~ 𝑁, 𝑋 ~ 𝑁

a. 𝑛 = 9, 𝑛 = 4
𝜇 =μ −𝜇 =0

12 6
+ =5
9 4

Why is there a plus in SD?

(𝑎 − 𝑏) = 𝑎 + 𝑏 − 2𝑎𝑏 → 2𝑎𝑏 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 0 𝑓𝑜𝑟 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑠𝑎𝑚𝑝𝑙𝑒𝑠


b. Yes, because both populations are normally distributed, the difference will also be normally
distributed

c. 𝑃(−10 < 𝑥̅ − 𝑥̅ < 10)

−10 − (0)
𝑧= = −2
12 6
+
9 4

10 − (0)
𝑧= =2
12 6
+
9 4

𝑃(𝑧 < 2) − 𝑃(𝑧 < −2)

0.9772 − 0.0228 = 0.9544

Type I and Type II error (table from slide 13)

𝐻 𝑤𝑎𝑠 𝑡𝑟𝑢𝑒 𝑏𝑢𝑡 𝑤𝑒 𝑟𝑒𝑗𝑒𝑐𝑡𝑒𝑑 → 𝑇𝑦𝑝𝑒 𝐼 𝑒𝑟𝑟𝑜𝑟 → 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒


𝐻 𝑤𝑎𝑠 𝑓𝑎𝑙𝑠𝑒 𝑏𝑢𝑡 𝑤𝑒 𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑 → 𝑇𝑦𝑝𝑒 𝐼𝐼 𝑒𝑟𝑟𝑜𝑟

𝛼 = 𝑃(𝑇𝑦𝑝𝑒 𝐼 𝑒𝑟𝑟𝑜𝑟) = 𝑃(𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 |𝐻 𝑖𝑠 𝑡𝑟𝑢𝑒)

Example - 𝑃(𝑝𝑢𝑛𝑖𝑠ℎ 𝑎𝑛 𝑖𝑛𝑛𝑜𝑐𝑒𝑛𝑡 𝑚𝑎𝑛) due to evidences

𝛽 = 𝑃(𝑇𝑦𝑝𝑒 𝐼𝐼 𝑒𝑟𝑟𝑜𝑟) = 𝑃(𝑎𝑐𝑐𝑒𝑝𝑡 𝐻 |𝐻 𝑖𝑠 𝑓𝑎𝑙𝑠𝑒)

Example - 𝑃(𝑑𝑒𝑐𝑙𝑎𝑟𝑒 𝑎 𝑔𝑢𝑖𝑙𝑡𝑦 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑛𝑛𝑜𝑐𝑒𝑛𝑡) due to evidences

𝐻 : 𝜇 = 200 𝑚𝑙
𝐻 : 𝜇 ≠ 200 𝑚𝑙

Criteria for acceptance: 191 < 𝑥̅ < 209


Criteria for rejection: 𝑥̅ ≤ 191 𝑜𝑟 𝑥̅ ≥ 209

𝑃(𝑥̅ ≤ 191 𝑜𝑟 𝑥̅ ≥ 209 | 𝜇 = 200)

= 𝑃(𝑥̅ ≤ 191 | 𝜇 = 200) + 𝑃( 𝑥̅ ≥ 209 | 𝜇 = 200)

191 − 200 209 − 200


𝑃 𝑧≤ +𝑃 𝑧 ≥
15 15
√9 √9

𝑃(𝑧 ≤ −1.8) + 𝑃(𝑧 ≥ 1.8)


= 𝑃(𝑧 ≤ −1.8) + 1 − 𝑃(𝑧 ≤ 1.8) = 𝑃(𝑧 ≤ −1.8) ∗ 2
Lecture 11
Tuesday, 20 February 2024 4:02 pm

Boys are claiming that marks obtained by boys are significantly lower than those obtained
by girls

𝐻 :𝜇 =𝜇
𝐻:𝜇 ≠𝜇

Draw flowchart to choose formula of testing

Independent vs dependent sample


The private and public institutions are operating independently and don't consult each
other for salaries, so independent sample

A slimming center is claiming that we reduce average weight significantly in a month. One
person will have two records, one in 𝑥 and one in 𝑥 . Therefore dependent samples

You join a exam prep center for GRE. You give one mock test at time of entrance so marks of
that. And then one month later same students give a test and marks recorded again. Also
dependent sample.

Quiz 1, quiz marks

If two separate entities, independent sample. If one entity recorded twice at different times,
dependent sample.

Dependent sample also called: paired observations, associated sample, repeated measure,
related sample

T4 - pooled t-test: whenever the pop sd are assumed to be equal


T5 - non-pooled t-test: whenever the pop sd are not assumed to be equal

Take ratio of sample standard deviations


𝑠
𝑖𝑓 < 2 𝑡ℎ𝑒𝑛 𝑝𝑜𝑜𝑙𝑒𝑑
𝑠

𝑠
𝑖𝑓 ≥ 2 → 𝑛𝑜𝑛 − 𝑝𝑜𝑜𝑙𝑒𝑑
𝑠

T3 and T5 are same just replace 𝜎 with 𝑠

For T4:
𝜎 =𝜎 =𝑠
(𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 ) (𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 )
𝑡= =
1 1
𝑠 𝑠 𝑠 +
+ 𝑛 𝑛
𝑛 𝑛

For T6: we need to make a column difference of 𝑥 and 𝑥 . Solve mean and sd for that

𝑠 → 𝑡𝑤𝑜 𝑔𝑟𝑜𝑢𝑝𝑠 ℎ𝑎𝑣𝑒 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠 𝑠𝑜 𝑤𝑒 𝑡𝑎𝑘𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛 𝑡𝑜 𝑔𝑒𝑡 𝑜𝑛𝑒

𝑛 𝑥̅ + 𝑛 𝑥̅
𝑥̅ =
𝑛 +𝑛

(𝑛 − 1)𝑠 + (𝑛 − 1)𝑠
𝑠 =
𝑛 +𝑛 −2

34 ∗ (26.1) + 29 ∗ (23.95)
𝑠 𝑖𝑛 𝑡ℎ𝑖𝑠 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 = = 25.2
35 + 30 − 2

(𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 )
𝑡= = 2.39
1 1
𝑠 +
𝑛 𝑛

𝑑𝑓 𝑓𝑜𝑟 𝑝𝑜𝑜𝑙𝑒𝑑 (𝑇4) = 𝑛 + 𝑛 − 2 = 30 + 35 − 2 = 63

𝑝 − 𝑣𝑎𝑙𝑢𝑒 ∗ 2 = 0.019 < 0.05 𝑠𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠

Example 10.6: ASSUMING NORMALITY PLEASE

𝐻 : 𝜇 =𝜇
𝐻 : 𝜇 <𝜇

𝛼 = 0.05

Two- mean test, independent sample, unknown pop sd,

𝑠 = 84.7, 𝑠 = 38.

𝑠 84.7
= = 2.22 > 2 → 𝑛𝑜𝑛 𝑝𝑜𝑜𝑙𝑒𝑑
𝑠 38.2

𝑠 𝑠
+
𝑛 𝑛
𝑑𝑓𝑓𝑜𝑟 𝑛𝑜𝑛 − 𝑝𝑜𝑜𝑙𝑒𝑑(𝑇5) = ∆= = 17
𝑠 𝑠
𝑛 𝑛
+
𝑛 −1 𝑛 −2
This pooled vs non-pooled can also be checked using box plot. If one box is more than
double of another, use non-pooled - called heuristic

T5

(𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 ) (394.6 − 468.3) − 0


𝑡= = = −2.681
(84.7) (38.2)
𝑠 𝑠 +
+ 14 6
𝑛 𝑛

𝑝 − 𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑑𝑓 = 18 𝑎𝑛𝑑 𝑡 = −2.681

Lies between 𝑡 . 𝑎𝑛𝑑 𝑡 . 𝑠𝑜 𝑡𝑎𝑘𝑒 𝑎𝑣𝑔 𝑜𝑓 𝑏𝑜𝑡ℎ

0.01 + 0.005
= 0.0075
2

𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑝𝑡(−2.681, 17) = 0.0078 < 0.05 𝑠𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻

Because the p-value is smaller than 𝛼, reject 𝐻

Conclusion: the data provides sufficient evidence to conclude that the average operative
time in dynamic system is significantly smaller than the average operative time in the static
system and the claim is valid

dt
pt - cumulative probability
qt - inverse function of pt
Rt
Lecture 12 Variance and
proportion testing
Thursday, 22 February 2024 4:04 pm
anova

𝑇 : 𝑡𝑤𝑜 𝑚𝑒𝑎𝑛 𝑧 − 𝑡𝑒𝑠𝑡 → 𝜎 , 𝜎 𝑘𝑛𝑜𝑤𝑛


𝑇 : 𝑝𝑜𝑜𝑙𝑒𝑑 𝑡 − 𝑡𝑒𝑠𝑡 → 𝜎 = 𝜎
𝑇 : 𝑛𝑜𝑛 − 𝑝𝑜𝑜𝑙𝑒𝑑 𝑡 − 𝑡𝑒𝑠𝑡 → 𝜎 ≠ 𝜎
𝑇 : 𝑝𝑎𝑖𝑟𝑒𝑑 𝑡 − 𝑡𝑒𝑠𝑡 → 𝜎 , 𝜎 𝑢𝑛𝑘𝑛𝑜𝑤𝑛

Insert image here of diagram

2a. 𝑇 One sample t-test


2b. 𝑇 pooled t-test

𝐻 : 𝜇 =𝜇
𝐻 : 𝜇 <𝜇

(𝑛 − 1)𝑠 + (𝑛 − 1)𝑠
𝑠 =
𝑛 +𝑛 −2

69(1.3 ) + 104(1.4 )
𝑠 = = 1.357 ≈ 1.36
70 + 106 − 2

(𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 ) 4.4 − 5.3


𝑡= = = −4.2968
1 1 1 1
𝑠 + 1.36 +
𝑛 𝑛 70 106

𝑑𝑓 = 𝑛 + 𝑛 − 2 = 174

Critical region method:


Round off df to 200 rn 𝑡 . ; = −1.653

Null hypothesis false, reject 𝐻

Data provide sufficient evidence that the population mean response


for business managers is lower than that for economics faculty

P-value method:
Using R (pt, -4.29,174), p-value is 0.00001478

P-value is less than 𝛼, reject 𝐻

5 𝑇 dependent sample
Car same, driver same, only difference of additive

Using R:

𝑡 = 1.59
𝑑𝑓 = 10 − 1 = 9

Critical region method:


𝑡 . ; 9 = 2.262

1.59 lies in acceptance region

𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.1449 > 0.05 𝑠𝑜 𝑎𝑐𝑐𝑒𝑝𝑡 𝐻


R workshop
Friday, 23 February 2024 9:36 pm

GUI:
Top left corner - script file
Bottom left - console
Top right - environment, history of commands, other tabs also
Bottom right - plots section (all graphs printed here), other tabs also

Two parts:
Core R (core software) and libraries

Some functions are available in core R and the rest are in libraries (which is extended R)
Core R functions are directly available, the libraries have to be imported
ggplot2 is the library for graphs/visualisations

To import a library go to tools, install packages, type name

num vs int type in R?

Convert character data type to factor type if it is in categories

Can do forecasting, prediction in R

Homoscedasticity?
Quiz Prep
Monday, 26 February 2024 7:00 pm

Type I/Type II error

a. 𝛼 → 𝑠𝑖𝑔 𝑙𝑒𝑣𝑒𝑙 𝑜𝑟 𝑇𝑦𝑝𝑒 𝐼 𝑒𝑟𝑟𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

Type I error is when we reject 𝐻 but it's actually true

𝑃(𝑥 ≥ 11) → 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻


= 𝑃(𝑥 = 11) + 𝑃(𝑥 = 12)
= 12𝐶11 ∗ (0.7) ∗ (0.3) + 12𝐶12 ∗ (0.7) ∗ (0.3)
= 0.07118 + 0.01384 = 0.08502

b. 𝛽 → 𝑇𝑦𝑝𝑒 𝐼𝐼 𝑒𝑟𝑟𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦

Type II error is when we accept 𝐻 , but it's actually false

𝑃(𝑥 < 11) = 1 − 𝑃(𝑥 ≥ 11)


1 − [12𝐶11 ∗ (0.9) (0.1) + 12𝐶12 ∗ (0.9) (0.1) ]
= 1 − (0.37657 + 0.28243)
= 1 − 0.659 = 0.341

a. Type I error is when we reject 𝐻 but it's actually true. So we need to find rejection region ki
probability

𝑃(𝑥 < 6 𝑜𝑟 𝑥 > 12) = 𝑃(𝑥 ≤ 5) + 𝑃(𝑥 ≥ 13) = 1 − (6 ≤ 𝑥 ≤ 12)

1 − [𝑃(𝑥 = 6) + 𝑃(𝑥 = 7) + 𝑃(𝑥 = 8) + 𝑃(𝑥 = 9) + 𝑃(𝑥 = 10) + 𝑃(𝑥 = 11) + 𝑃(𝑥 = 12)]

=1− 15𝐶𝑖 ∗ (0.6) (0.4)


= 1 − 0.9391 = 0.0609

b. Type II error is when we accept 𝐻 but it's actually false. So we need acceptance ki
probability

𝑝 = 0.5

𝑃(6 ≤ 𝑥 ≤ 12) = 15𝐶𝑖 ∗ (0.5) (0.5) = 0.8454

𝑝 = 0.7

𝑃(6 ≤ 𝑥 ≤ 12) = 15𝐶𝑖 ∗ (0.7) (0.3) = 0.8695

a. Type I error is when we reject 𝐻 but it's actually true. Need to find probability of rejection
region

𝑝 = 0.6

220 260
= 0.55, = 0.65
400 400

𝑃(𝑝̂ < 0.55) + 𝑃(𝑝̂ > 0.65)

0.55 − 0.6 0.65 − 0.5


𝑃 𝑧< + 1−𝑃 𝑧 <
0.6 ∗ 0.4 0.6 ∗ 0.4
400 400

𝑃(𝑧 < −2.04) + [1 − 𝑃(𝑧 < 2.04)] = 0.04123

b. Type II error is when we accept H but it's actually false. Use new μ to find probability of
acceptance region

𝑃(0.55 < 𝑥 < 0.65), 𝑝 = 0.48

0.55 − 0.48 0.65 − 0.48


𝑃 <𝑧<
0.48 ∗ 0.52 0.48 ∗ 0.52
400 400

= 𝑃(2.8022 < 𝑧 < 6.8054)

= 𝑃(𝑧 < 6.8054) − 𝑃(𝑧 < 2.8022) = 0.02537


a. Type I error is when we reject 𝐻 but it's actually true. Need to find probability of rejection
region

𝑃(𝑥 < 191) + 𝑃(𝑥 > 209)

191 − 200 209 − 200


𝑃 𝑧< +𝑃 𝑧 >
15 15
√9 √9

= 𝑃(𝑧 < −1.8) + [1 − 𝑃(𝑧 < 1.8)]


= 0.07186 ≈ 0.072

b. Type II error is when we accept 𝐻 but it's actually false. Use new 𝜇 to find probability of
acceptance region

𝑃(191 < 𝑥 < 209) → 𝑎𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒

191 − 215 209 − 215


𝑃 <𝑧<
15 15
√9 √9

𝑃(−4.8 < 𝑧 < −1.2)

= 𝑃(𝑧 < −1.2) − 𝑃(𝑧 < −4.8) = 0.115


Lecture 13
Tuesday, 27 February 2024 4:01 pm

Chapter 16: One way and Two way ANOVA

One way ANOVA:

In ch 9: 𝐻 : 𝜇 = 800 ℎ𝑟𝑠 → 𝑏𝑦 𝑧 𝑜𝑟 𝑡 𝑡𝑒𝑠𝑡


In ch10: 𝐻 : 𝜇 = 𝜇 → 𝑏𝑦 𝑧 𝑜𝑟 𝑡 𝑡𝑒𝑠𝑡
In ch16: 𝐻 : 𝜇 = 𝜇 = ⋯ = 𝜇 → 𝑏𝑦 𝑓 𝑡𝑒𝑠𝑡

When someone wants to compare more than 2 means


Name - Analysis of variance

Why does name have variance but procedure has mean? Because it uses f-test, the formula for
which is:

Bifurcate/decompose variance into 2 components, one is variance from known source and the
other is variance of unknown source

𝑠 𝑠
𝑓= =
𝑠 𝑠

Suppose there are 3 teachers of SI

Compare the average marks by teacher A, teacher B and teacher C

A B C
1
2

n

Variance between the columns is a known source (each teacher has a different style etc so
differences are to be expected among marks)

Unknown source is variance within a column (unknown variation). If same teacher is teaching,
there is a difference within them teaching so we don't exactly know why

If we take A,B,C as three varieties of wheat, 10 seeds of each, same conditions etc.
Between the varieties, differences are expected cause quality differences so known source
Within for eg variety A, why isn't the output of all 10 plants same? That is unknown variation

Only right-tailed test

Example from book:

𝐻 : 𝜇 =𝜇 =𝜇 =𝜇
𝐻 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑎𝑟𝑒 𝑢𝑛𝑒𝑞𝑢𝑎𝑙 (𝑛𝑜𝑡 𝑎𝑙𝑙 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙)

𝛼 = 0.05

𝑠
𝑓 − 𝑡𝑒𝑠𝑡,
𝑠

Northeast Midwest South West


𝑥̅ 13 𝑥̅
= 11.67
𝑠

SSTR (numerator of known source ka variance) for known source:


𝑆𝑆𝑇𝑅 = 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 (𝑟𝑒𝑔𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑖𝑠 𝑐𝑎𝑠𝑒) 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

𝑆𝑆𝑇𝑅 = 𝑛 (𝑥̅ − 𝑥̅ ) + 𝑛 (𝑥̅ − 𝑥̅ ) + ⋯ + 𝑛 (𝑥̅ − 𝑥̅ ) = 𝑛 (𝑥̅ − 𝑥̅ )

= 5(13 − 11.67) + 6(14.5 − 11.67) + 4(10 − 11.67) + 5(9.2 − 11.67)


= 97.5
Check these two values

SSE (numerator of unknown source ka variance) for unknown source:


𝑆𝑆𝐸 = 𝑠𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑠𝑜𝑢𝑟𝑐𝑒 = 𝑠𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

𝑆𝑆𝐸 = (𝑛 − 1)𝑠 + (𝑛 − 1)𝑠 + ⋯ + (𝑛 − 1)𝑠 = (𝑛 − 1)𝑠

= 4(1.87 ) + 5(2.58 ) + 3(2.58 ) + 4(1.92 )


= 82.3

ANOVA table:

SV (sources of SS (sum of df MSS (mean SS) f P-value


variation) squares)
Treatment 97.5 (no. of regions - 𝑠 = 97.5/3 = 32.5 𝑓 = Less than 0.005
(b/w regions) 1)
SSTR 𝑑𝑓𝑛 = 4 - 1 = 3 = 6.32
SSE (error) 82.3 𝑑𝑓𝑑 = 19 - 3 = 𝑠 = 5.14375
16
Total 179.8 (Total
SST observations -
1)
20 - 1 = 19

We have 2 df in f-stats
dfn - degree of freedom numerator
dfd - degree of freedom denominator

P-value < 𝛼 so reject null hypothesis.

Data provides sufficient evidence to conclude that a significant difference exists in last year's
mean energy consumption by households among the four U.S. regions

Null hypothesis - insignificant


Alternate hypothesis - significant

Means are significant - difference exists

Now to find where difference exists. Which region consumes more electricity and which
consumes less

If null hypothesis of ANOVA is rejected, then we have to perform pair-wise comparison


Lecture 14
Thursday, 29 February 2024 4:03 pm

Mid term upto one-way ANOVA (two proportion/one proportion not included)
MCQs, fill in the blanks, numerical

One-way ANOVA: (alternate def) decomposing total variation into two components, one is called
known sources and other is called unknown sources

Partitioning of sum of squares:


𝑆𝑆 𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆 𝑇𝑟 + 𝑆𝑆𝐸
= 𝑠𝑠 𝑘𝑛𝑜𝑤𝑛 𝑠𝑜𝑢𝑟𝑐𝑒 + 𝑠𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑠𝑜𝑢𝑟𝑐𝑒
= 𝑠𝑠 𝑏/𝑤 𝑠𝑎𝑚𝑝𝑙𝑒 + 𝑠𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒
= 𝑠𝑠 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 + 𝑠𝑠 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑

Breaking down variance into 2

Partitioning of degree of freedom


𝑆𝑆 𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆 𝑇𝑟 + 𝑆𝑆𝐸
Df: (n-1) k-1 n-k

n - number of observations

Insert flowchart picture

4 assumptions of one way ANOVA


1. Data must be random
2. Data should be normally distributed
3. Data must have equal population variances
𝜎 =𝜎 =𝜎 =𝜎 =𝜎 =𝜎
Largest sd/smallest sd

4. Treatments are independent

ANOVA qs

Step 1:
𝐻 : 𝜇 =𝜇 =𝜇 =𝜇 =𝜇 =𝜇

𝐻 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 2 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑢𝑛𝑒𝑞𝑢𝑎𝑙 (𝑛𝑜𝑡 𝑎𝑙𝑙 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑠𝑎𝑚𝑒)

Step 2:
𝛼 = 0.05
Step 3:

1 2 3 4 5 6
𝑥̅ 17.2 17.175 18.175 17.75 18.425 18.025 𝑥̅
= 17.79167
𝑠 1.367 2.709 3.769 7.217 3.156 2.663

𝑆𝑆𝑇𝑅 = 𝑛 (𝑥̅ − 𝑥̅ )

𝑆𝑆𝐸 = (𝑛 − 1)𝑠

Checking assumptions, waise variances aren't equal so transformation required before using
ANOVA

SV (sources of SS (sum of df MSS f P-value


variation) squares) (mean SS)
Treatment 5.3383 (no. of 𝑠 = 𝑓 = Greater than 0.1
(b/w machines - 5.3383/5=
machines) 1) 1.067 61.067 From R: 0.902
=
SSTR 𝑑𝑓𝑛 = 6 - 1 3.48
=5 = 0.3
SSE 62.64 𝑑𝑓𝑑 = 23 - 62.64
𝑠 =
(error/residual 5 = 18 18
) = 3.48
(within
machines)
Total (Total
SST observatio
ns - 1)
24 - 1 = 23

P-value > 0.05 so accept null hypothesis. Efficiency of all machines is the same

You might also like