Tuesday, 16 January 2024 2:58 PM
Tuesday, 16 January 2024 2:58 PM
Mean
Median
Mode
Visualisations
Graph - quantitative data - histogram, frequency curve, scatter
plot, stem and leaf
Chart - qualitative data - bar, pie
Predictive stats
𝐻 (𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠): 𝜇 = 𝜇
𝐻 : (𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠) 𝜇 < 𝜇
Females complaining that avg salary of males is much higher than that of
females
Step 4: Training data (70% or 80%) and testing data (30% or 20%)
(major step in ML) - need training data file to create model then use
testing data to verify the accuracy of the model then deploying model
Randomly take 70% records and store in a file called training data. The
rest of the 30% go to a file called testing data
Propose a model according to data
Make scatter plots, if linear then use a linear model, if quadratic use that
Supervised/unsupervised learning
COMPLETE
Lecture 2
Thursday, 18 January 2024 4:05 pm
Population vs sample: population is whole group (universal set) (complete set of all entities in a
particular boundary or timeframe), sample is a part of the group (subset)
Population of all account holders of HBL bank from the inception of the bank to today - space bound
by name of bank - doesn't require census cause data is already digitally present in the system
Discrete - countable
Continuous - measurable that needs machinery or tools to check
Strings can be ignored as far as predictive powers are concerned. Categorical variables still have
some sort of minimal predictive power. Ordinal are best in qualitative for predictive analytics.
𝜎 → 𝑝𝑜𝑝 𝑣𝑎𝑟
𝑠 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟
𝑃 → 𝑝𝑜𝑝 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛
𝑝 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛
𝜌 → 𝑝𝑜𝑝 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝑟 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝐻 : 𝜇 = 75000
𝐻 : 𝜇 ≠ 75000
Inferential stats: sample ko dekh kar population parameter ko predict karna ya estimate karna ya
test the parameter
𝑁
All possible samples without replacement =
𝑛
∑𝑋
𝜇 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 =
𝑁
∑(𝑋 − 𝜇)
𝜎 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑀𝑆𝐷 =
𝑁
∑𝑥
𝑥̅ =
𝑛
∑(𝑥 − 𝑥̅ )
𝑠 =
𝑛−1
Why do we use n-1 in sample variance?
𝜇̂ = 𝑥̅
𝜎 =𝑠
Tests of normality:
1. Kolmogrov Smirnov test
2. Shapero test
Symmetric, skewness = 0
Skewness +ve
Skewness -ve
Chapter 1
The elements in a population have at least one factor of homogeneity, but with other segments we
can bifurcate into segments (so also have heterogenous elements). The number of elements in a
population can be finite or infinite
Census: complete enumeration, collect data and analyse the whole population
Random sample - everyone has a chance of being selected, not necessarily equal probability for
everyone
Simple random sample - random sample with equal probability for everyone
In stratified sampling, bifurcate population with respect to the heterogenous factor, select sample
from each segment using simple random and then combine them all for a final sample
Equal allocation vs proportional allocation (proportional to size of stratum)
For example 1000 pages in a book, divide into 10 segments according to page number. Take the
first page number randomly, for eg 97. After that take page numbers systematically for example
197, 297, …, 997. First element is random, the rest are systematic in sequence. Segments always of
equal size
Proportion is mainly for qualitative characteristics but can also be used on quantitative - it is
basically probability
Quantitative techniques can be used on only quantitative discrete or numerical data but are more
accurate
Qualitative techniques can be used on both quantitative and qualitative data but are less accurate
Chapter 7
m = number of samples
m = N^n or NCn
Sampling distribution of mean is a table which is comprised on all values of sample means along
with their probabilities such that the total of probabilities will be equal to 1. It is a probability
distribution of a statistic.
Sampling error is the difference between the actual value and estimated value of a parameter 𝑥̅ − 𝜇.
It is reduced if we increase sample size. One more method is to use most appropriate sampling
method (simple random for homogenous population etc)
Sampling error is the error resulting from using a sample to estimate a population characteristic
(parameter)
Standard error, is the standard deviation of the sampling distribution of mean. Reduce by increase
sample size.
Fpc value is neglected when population size is very high. So either N is large or N is small but
sampling is done with replacement then its considered a large one and performance of the
population is considered as that of an infinite population
Using table:
𝑚𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚. 𝑑𝑖𝑠𝑡 𝑜𝑓 𝑚𝑒𝑎𝑛 = 𝜇 ̅ = 𝐸(𝑥̅ ) = 𝑥̅ 𝑃(𝑥̅ )
𝑛 = 5, 𝑛 = 3, 𝑛 =2
𝑥 +𝑥 +𝑥
𝑥̅ =
3
𝑛 𝑥 +𝑛 𝑥 +𝑛 𝑥
𝑥̅ =
𝑛 +𝑛 +𝑛
x f
150-200 5
200-250 9
250-300 3
⋮ ⋮
50
Probability distribution:
x P(x)
175 5/50
225 9/50
275 3/50
⋮ ⋮
50/50 = 1
𝑁 = 𝑝𝑜𝑝 𝑠𝑖𝑧𝑒 = 3
𝑛 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 = 3
By: with replacement
𝑚 = 𝑁 = 3 = 27
1+3+5
𝜇= =3
3
(1 − 3) + (3 − 3) + (5 − 3) 8
𝜎 = =
3 3
2403 8
𝜎 ̅ = 𝑉(𝑥̅ ) = 𝑥̅ 𝑃(𝑥̅ ) − 𝑥̅ 𝑃(𝑥̅ ) = −3 =
243 9
F. Compare the population mean with the mean of the sampling
distribution of mean
𝜇=𝜇 ̅
𝜎
𝜎̅ =
𝑛
𝜎
𝜎̅=
√𝑛
Sampling distribution (s.d): It is the p.d of 𝑥̅ . Probability distribution of statistic 𝑥̅ . Two columns
𝑥̅ and 𝑃(𝑥̅ )
i. 𝜇 ̅ = 𝜇
𝜎
ii. 𝜎 ̅ =
√𝑛
iii. If 𝑋 ~ 𝑁 (X follows normal distribution) then 𝑥̅ too
iv. If 𝑋 ~ 𝑎𝑛𝑦 (X follows any dist.) then 𝑥̅ ≈ 𝑁 (approximately normal) provided 𝑛 ≥ 25 →
central limit theorem
𝑥−𝜇
𝑧= → 𝑧 − 𝑠𝑐𝑜𝑟𝑒 𝑟𝑒𝑔𝑢𝑙𝑎𝑟
𝜎
For CLT:
𝑥̅ − 𝜇 ̅ 𝑥̅ − 𝜇
𝑧= =
𝜎̅ 𝜎/√𝑛
Qs 2: Let’s say that 80% of all business startups in the IT industry report that they generate a
profit in their first year. If a sample of 10 new IT business startups is selected, find the probability
that exactly seven will generate a profit in their first year. Find the probability that at least 4 will
generate a profit in their first year. Find the probability that between 3 to 5 startups will be
successful
𝑃(𝑥 = 7):
10
0.8 0.2 = 0.201
7
𝑃(𝑥 ≥ 4):
𝑃(3 ≤ 𝑥 ≤ 5):
𝑛
𝑃(𝑋 = 𝑥) = 𝑝 (1 − 𝑝)
𝑥
Use cumulative table unless p does not lie in the table. In that case, manual work
Qs 3: An auditor takes a random sample of size 36 from a population of size 1,000 accounts
receivable. The mean value of the accounts receivable for the population is $260.00, with the
population standard deviation $45.00.
(using with replacement) - Related to sampling distribution of mean
(a) What is the probability that the sample mean will be less than $250.00?
(b) What is the probability that the sample mean will be within $15.00 of the population mean?
We don't know the distribution of X but n=36>25 so approximately normal using CLT
𝜇 = $260, 𝜎 = $45
Because population is unknown but the sample size is large, therefore we can use
normal distribution to find out this probability
𝑥̅ − 𝜇 ̅ 𝑥̅ − 𝜇
𝑧= =
𝜎̅ 𝜎/√𝑛
𝜇 − 15 − 𝜇 𝑥̅ − 𝜇 𝜇 + 15 − 𝜇
=𝑃 𝜎 < 𝜎 < 𝜎
√𝑛 √𝑛 √𝑛
−15 15
=𝑃 <𝑧< = 𝑃(−2 < 𝑥̅ < 2)
45 45
√36 √36
𝑥−𝜇
𝑧=
𝜎
𝑥−𝜇
𝑣𝑎𝑟(𝑧) = 𝑣𝑎𝑟
𝜎
𝑥 𝜇 𝑥 𝜇
=𝑣 − =𝑣 −𝑣
𝜎 𝜎 𝜎 𝜎
𝑥 𝑣(𝑥)
=𝑣 −0= −0
𝜎 𝜎
𝜎
−0=1
𝜎
Qs 4: The mean selling price of senior condominiums in Green Valley over a year was $215,000.
The population standard deviation was $25,000. A random sample of 100 new unit sales was
obtained.
a. What is the probability that the sample mean selling price was more than $210,000?
b. What is the probability that the sample mean selling price was between $213,000 and $217,000?
c. What is the probability that the sample mean selling price was between $214,000 and $216,000
𝜇 = 215000, 𝜎 = 25000
𝑛 = 100 > 25
210000 − 215000
𝑃 𝑧> = 𝑃(𝑧 > (−0.632))
25000
√100
𝑁−𝑛 𝜎
𝜎̅= .
𝑁 − 1 √𝑛
𝜇 =𝑃
𝑃(1 − 𝑃)
𝜎 =
𝑛
Z-score:
𝑥−𝜇 𝑝̂ − 𝑃
𝑧= → → 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛
𝜎 𝑃(1 − 𝑃)
𝑛
P = 0.6, n = 100
Sample size is large so 𝑝̂ is approximately normally distirbuted and therefore we can use z-score
method to solve this
0.25(0.75)
𝑆𝐸 = 𝜎 =
120
𝑝−𝑃 𝑎
𝑃 < = 0.9
𝜎 𝜎
𝑎
𝑃 𝑧< = 0.9
𝜎
𝑎𝑡 𝑃 = 0.9, 𝑧 = 1.29
𝑎
= 1.29
0.25(0.75)
120
𝑎 = 0.051
qnorm(0.9, 0, 1) = 1.281552
𝜎
𝜎̅= → 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
√𝑛
𝜎 𝑁−𝑛
𝜎̅= . → 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
√𝑛 𝑁 − 1
𝑃(1 − 𝑃)
𝜎 = → 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑛
𝑃(1 − 𝑃) 𝑁 − 𝑛
𝜎 = . → 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑛 𝑁−1
Lecture 7 Double slot extra class
Before mids (statistical inference focused): need to complete course upto analysis of variance
(ch 14) - anova is included in mid. After mids, predictive analytics
Statistics
Estimation of Testing of
parameter hypothesis
Estimation main parameter is completely unknown and we estimate it using sample statistics.
If party A says however that the proportion of their supporters is 0.6 and party B wants to
challenge it. We have some pre-defined knowledge of the parameter that needs to be checked so
this is hypothesis testing
Estimation of parameter
Point estimation
𝜇̂ = 500 ℎ𝑜𝑢𝑟𝑠
Draw a sample and find mean
Sample mean is a point estimate of population mean
𝜇̂ = 𝑥̅
𝜎 =𝑠
𝑃=𝑝
𝜌=𝑟
𝑥̅ − 𝜇 = 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟
To reduce error, increase sample size, and ensure random sampling
Example 8.1:
𝑥̅ = 63.28 𝑈𝑆𝐷 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑠 = 63.08 𝑈𝑆𝐷 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑠 = 7.94 𝑈𝑆𝐷 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡. 𝑑𝑒𝑣
𝑥−𝜇
𝑧= → 𝑛𝑜𝑟𝑚𝑎𝑙 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝜎
𝑥̅ − 𝜇
𝑧= 𝜎 → 𝑧 𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑑𝑖𝑠𝑡 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
√𝑛
Make 𝜇 subject:
𝜎
𝑥̅ − 𝜇 = 𝑧.
√𝑛
𝜎
𝜇 = 𝑥̅ ± 𝑧.
√𝑛
𝜎 𝜎
𝑥̅ − 𝑧 . 𝑡𝑜 𝑥̅ + 𝑧 .
√𝑛 √𝑛
Assumptions: Figure out qq plot, kolmogrov test and shapero test (we will do in R)
1. Must be simple random sample Qq plot should be a straigt upward direction line
2. The population must be normal or sample size large (CLT)
3. 𝜎 known → unrealistic assumption in real life
KS test (Kolmogrov-Smirnoff Test) in R
For known population st.dev Shapero test
The graphs show that the population is not normal, but because the sample size is large,
according to the central limit theorem, we can assume that the sampling distribution of sample
mean is approximately normal
𝑥̅ = 36.4 𝑦𝑒𝑎𝑟𝑠
1 − 𝛼 = 0.95
𝛼 = 0.05
𝛼
= 0.025 (𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑤𝑒 𝑓𝑖𝑛𝑑 𝑧)
2
There is a 95% probability that the true mean of the population lies within the confidence
interval
Alternatively, if we take a 100 different samples, true mean will lie within 95% of those
confidence intervals. 5 intervals will be such that the true mean will lie outside the confidence
interval.
Different type of question would be to determine sample size
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
𝜎
𝑧. = 𝑥̅ − 𝜇 = 𝐸 (𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟 𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑟𝑟𝑜𝑟)
√𝑛
𝑧. 𝜎
𝑛= → 𝑟𝑜𝑢𝑛𝑑𝑒𝑑 𝑢𝑝 𝑡𝑜 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑤ℎ𝑜𝑙𝑒 𝑛𝑢𝑚𝑏𝑒𝑟
𝑥̅ − 𝜇
Assumptions:
1. Must be simple random sample
2. The population must be normal or sample size large (CLT)
3. 𝜎 unknown → realistic assumption in real life
𝑠 𝑠
𝑥̅ − 𝑡 . 𝑡𝑜 𝑥̅ + 𝑡 .
√𝑛 √𝑛
The graphs show that the population is not normal, but because the sample size is large,
according to the central limit theorem, we can assume that the sampling distribution of sample
mean is approximately normal
Because the population st.dev is unknown, we are using t-test
𝑥̅ = 513.32 𝑈𝑆𝐷
𝑠 = 262.23 𝑈𝑆𝐷
𝛼
= 0.025
2
𝑑𝑓 = 𝜈 = 𝑛 − 1 = 24
𝑡 ;
=𝑡 . ; = 2.064
In the t-table, probabilities are outside (on top), t-scores are inside. In the z-table, however,
probabilities are inside and z-scores are outside
dt
pt
qt
rt
qt(0.025, 24)
Chapter 9: hypothesis testing
The claim comes in alternate hypothesis (𝐻 ). The null hypothesis (𝐻 ) is unbiased, neutral etc
Start by assuming null hypothesis is true, then prove whether or not alternate is true
𝑥̅ − 𝜇
𝑧=
𝜎/√𝑛
𝑥̅ − 𝜇
𝑧=
𝑠/√𝑛
Example 9.5
𝐻 : 𝜇 = 275 𝑦𝑎𝑟𝑑𝑠
𝐻 : 𝜇 < 275 𝑦𝑎𝑟𝑑𝑠
𝛼 = 0.05
𝑛 = 25
𝜎 = 20 𝑦𝑎𝑟𝑑𝑠 → 𝑧 − 𝑡𝑒𝑠𝑡
264.4 − 275
𝑧= = −2.65
20
√25
𝑧 . = −1.64
If z < -1.64, reject null hypothesis, otherwise accept
Because the z-statistic -2.65, is less than the critical value -1.64, therefore do not accept the null
hypothesis.
Conclusion: data provides sufficient evidence to conclude that mean driving distance of Jack is
less than 275 yards
Lecture 9
Tuesday, 13 February 2024 4:00 pm
Qs 1:
𝐻 : 𝜇 = 70 𝑦𝑒𝑎𝑟𝑠
𝐻 : 𝜇 > 70 𝑦𝑒𝑎𝑟𝑠
𝑥̅ = 71.8 𝑦𝑒𝑎𝑟𝑠
𝑛 = 100
𝜎 = 8.9 𝑦𝑒𝑎𝑟𝑠
𝛼 = 0.05
𝑥̅ − 𝜇 71.8 − 70
𝑧 = = = 2.0225
𝜎/√𝑛 8.9/√100
𝑧 . = 1.645
𝑧 𝑓𝑜𝑟 𝑧 − 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙
Conclusion: The data provides sufficient evidence not to accept the null hypothesis and conclude
that the mean life span today is greater than 70 years
Qs 7:
𝐻 : 𝜇 = 35 𝑚𝑖𝑛𝑠
𝐻 : 𝜇 < 35 𝑚𝑖𝑛𝑠
𝑥̅ = 33.1 𝑚𝑖𝑛𝑠
𝑛 = 20
𝑠 = 4.3 𝑚𝑖𝑛𝑠
𝑑𝑓 = 19
Assumption: if population sd unknown, then the samples sd can be used as a point estimate
𝛼 = 0.05
We use the t-test:
33.1 − 35
𝑡 = = −1.976
4.3/√20
𝑡 . = −1.729
Conclusion: the data provides sufficient evidence to not accept the null hypothesis and conclude
that the average time taken to complete the standardized test is less than 35 minutes
The strategy of hypothesis testing we've been doing so far is critical value approach
Qs 1:
𝑥̅ − 𝜇 71.8 − 70
𝑧 = = = 2.0225
𝜎/√𝑛 8.9/√100
Check in z-table
𝑃 = 1 − 0.9783 = 0.0217
Qs 7:
33.1 − 35
𝑡 = = −1.976
4.3/√20
If two-tailed p-value test, take from one side and multiply by 2 for P value
Lecture 10
Thursday, 15 February 2024 4:00 pm
Chapter 10:
Same 6 steps of hypothesis tests
Formula different
𝛼 = 0.05
𝐻𝑦𝑑𝑒𝑟𝑎𝑏𝑎𝑑: 𝑥 = 750, 𝜎 = 20
𝐾𝑎𝑟𝑎𝑐ℎ𝑖: 𝑥 = 780, 𝜎 = 25
𝑥−𝜇
𝑧=
𝜎
In CLT z-score:
𝑥̅ − 𝜇
𝑧= 𝜎 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑥̅ 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
√𝑛
𝑥̅ − 𝜇
𝑧= 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑥̅ 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝜎 𝑁−𝑛
.
√𝑛 𝑁−1
𝑝̂ − 𝑃
𝑧= 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑝̂ 𝑤𝑖𝑡ℎ 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑃(1 − 𝑃)
𝑛
𝑝̂ − 𝑃
𝑧= 𝐶𝐿𝑇 𝑓𝑜𝑟 𝑝̂ 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡
𝑃(1 − 𝑃) 𝑁−𝑛
.
𝑛 𝑁−1
(𝑥 − 𝑥 ) − (𝜇 − 𝜇 )
𝑧=
𝜎 𝜎
+
𝑛 𝑛
𝜎 𝜎
→
√𝑛 𝑛
For the question:
𝑃 − 𝑣𝑎𝑙𝑢𝑒 ≈ 0
From R: 0.00000014487
Because p value is approx 0, we reject null hypothesis and conclude that the null hypothesis is not
true
𝑃 ≤ 𝛼, 𝑠𝑜 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻
Conclusion: the data provides sufficient evidence to not accept the null hypothesis and conclude
that the performance of employees is significantly better in Karachi as compared to Hyderabad
(due to higher sample mean)
Qs from slide 8
𝜇 = 40, 𝜎 = 12
𝜇 = 40, 𝜎 = 6
𝑋 ~ 𝑁, 𝑋 ~ 𝑁
a. 𝑛 = 9, 𝑛 = 4
𝜇 =μ −𝜇 =0
12 6
+ =5
9 4
−10 − (0)
𝑧= = −2
12 6
+
9 4
10 − (0)
𝑧= =2
12 6
+
9 4
𝐻 : 𝜇 = 200 𝑚𝑙
𝐻 : 𝜇 ≠ 200 𝑚𝑙
Boys are claiming that marks obtained by boys are significantly lower than those obtained
by girls
𝐻 :𝜇 =𝜇
𝐻:𝜇 ≠𝜇
A slimming center is claiming that we reduce average weight significantly in a month. One
person will have two records, one in 𝑥 and one in 𝑥 . Therefore dependent samples
You join a exam prep center for GRE. You give one mock test at time of entrance so marks of
that. And then one month later same students give a test and marks recorded again. Also
dependent sample.
If two separate entities, independent sample. If one entity recorded twice at different times,
dependent sample.
Dependent sample also called: paired observations, associated sample, repeated measure,
related sample
𝑠
𝑖𝑓 ≥ 2 → 𝑛𝑜𝑛 − 𝑝𝑜𝑜𝑙𝑒𝑑
𝑠
For T4:
𝜎 =𝜎 =𝑠
(𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 ) (𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 )
𝑡= =
1 1
𝑠 𝑠 𝑠 +
+ 𝑛 𝑛
𝑛 𝑛
For T6: we need to make a column difference of 𝑥 and 𝑥 . Solve mean and sd for that
𝑠 → 𝑡𝑤𝑜 𝑔𝑟𝑜𝑢𝑝𝑠 ℎ𝑎𝑣𝑒 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠 𝑠𝑜 𝑤𝑒 𝑡𝑎𝑘𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛 𝑡𝑜 𝑔𝑒𝑡 𝑜𝑛𝑒
𝑛 𝑥̅ + 𝑛 𝑥̅
𝑥̅ =
𝑛 +𝑛
(𝑛 − 1)𝑠 + (𝑛 − 1)𝑠
𝑠 =
𝑛 +𝑛 −2
34 ∗ (26.1) + 29 ∗ (23.95)
𝑠 𝑖𝑛 𝑡ℎ𝑖𝑠 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 = = 25.2
35 + 30 − 2
(𝑥̅ − 𝑥̅ ) − (𝜇 − 𝜇 )
𝑡= = 2.39
1 1
𝑠 +
𝑛 𝑛
𝐻 : 𝜇 =𝜇
𝐻 : 𝜇 <𝜇
𝛼 = 0.05
𝑠 = 84.7, 𝑠 = 38.
𝑠 84.7
= = 2.22 > 2 → 𝑛𝑜𝑛 𝑝𝑜𝑜𝑙𝑒𝑑
𝑠 38.2
𝑠 𝑠
+
𝑛 𝑛
𝑑𝑓𝑓𝑜𝑟 𝑛𝑜𝑛 − 𝑝𝑜𝑜𝑙𝑒𝑑(𝑇5) = ∆= = 17
𝑠 𝑠
𝑛 𝑛
+
𝑛 −1 𝑛 −2
This pooled vs non-pooled can also be checked using box plot. If one box is more than
double of another, use non-pooled - called heuristic
T5
0.01 + 0.005
= 0.0075
2
Conclusion: the data provides sufficient evidence to conclude that the average operative
time in dynamic system is significantly smaller than the average operative time in the static
system and the claim is valid
dt
pt - cumulative probability
qt - inverse function of pt
Rt
Lecture 12 Variance and
proportion testing
Thursday, 22 February 2024 4:04 pm
anova
𝐻 : 𝜇 =𝜇
𝐻 : 𝜇 <𝜇
(𝑛 − 1)𝑠 + (𝑛 − 1)𝑠
𝑠 =
𝑛 +𝑛 −2
69(1.3 ) + 104(1.4 )
𝑠 = = 1.357 ≈ 1.36
70 + 106 − 2
𝑑𝑓 = 𝑛 + 𝑛 − 2 = 174
P-value method:
Using R (pt, -4.29,174), p-value is 0.00001478
5 𝑇 dependent sample
Car same, driver same, only difference of additive
Using R:
𝑡 = 1.59
𝑑𝑓 = 10 − 1 = 9
GUI:
Top left corner - script file
Bottom left - console
Top right - environment, history of commands, other tabs also
Bottom right - plots section (all graphs printed here), other tabs also
Two parts:
Core R (core software) and libraries
Some functions are available in core R and the rest are in libraries (which is extended R)
Core R functions are directly available, the libraries have to be imported
ggplot2 is the library for graphs/visualisations
Homoscedasticity?
Quiz Prep
Monday, 26 February 2024 7:00 pm
a. Type I error is when we reject 𝐻 but it's actually true. So we need to find rejection region ki
probability
1 − [𝑃(𝑥 = 6) + 𝑃(𝑥 = 7) + 𝑃(𝑥 = 8) + 𝑃(𝑥 = 9) + 𝑃(𝑥 = 10) + 𝑃(𝑥 = 11) + 𝑃(𝑥 = 12)]
b. Type II error is when we accept 𝐻 but it's actually false. So we need acceptance ki
probability
𝑝 = 0.5
𝑝 = 0.7
a. Type I error is when we reject 𝐻 but it's actually true. Need to find probability of rejection
region
𝑝 = 0.6
220 260
= 0.55, = 0.65
400 400
b. Type II error is when we accept H but it's actually false. Use new μ to find probability of
acceptance region
b. Type II error is when we accept 𝐻 but it's actually false. Use new 𝜇 to find probability of
acceptance region
Why does name have variance but procedure has mean? Because it uses f-test, the formula for
which is:
Bifurcate/decompose variance into 2 components, one is variance from known source and the
other is variance of unknown source
𝑠 𝑠
𝑓= =
𝑠 𝑠
A B C
1
2
⋮
n
Variance between the columns is a known source (each teacher has a different style etc so
differences are to be expected among marks)
Unknown source is variance within a column (unknown variation). If same teacher is teaching,
there is a difference within them teaching so we don't exactly know why
If we take A,B,C as three varieties of wheat, 10 seeds of each, same conditions etc.
Between the varieties, differences are expected cause quality differences so known source
Within for eg variety A, why isn't the output of all 10 plants same? That is unknown variation
𝐻 : 𝜇 =𝜇 =𝜇 =𝜇
𝐻 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑎𝑟𝑒 𝑢𝑛𝑒𝑞𝑢𝑎𝑙 (𝑛𝑜𝑡 𝑎𝑙𝑙 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙)
𝛼 = 0.05
𝑠
𝑓 − 𝑡𝑒𝑠𝑡,
𝑠
ANOVA table:
We have 2 df in f-stats
dfn - degree of freedom numerator
dfd - degree of freedom denominator
Data provides sufficient evidence to conclude that a significant difference exists in last year's
mean energy consumption by households among the four U.S. regions
Now to find where difference exists. Which region consumes more electricity and which
consumes less
Mid term upto one-way ANOVA (two proportion/one proportion not included)
MCQs, fill in the blanks, numerical
One-way ANOVA: (alternate def) decomposing total variation into two components, one is called
known sources and other is called unknown sources
n - number of observations
ANOVA qs
Step 1:
𝐻 : 𝜇 =𝜇 =𝜇 =𝜇 =𝜇 =𝜇
Step 2:
𝛼 = 0.05
Step 3:
1 2 3 4 5 6
𝑥̅ 17.2 17.175 18.175 17.75 18.425 18.025 𝑥̅
= 17.79167
𝑠 1.367 2.709 3.769 7.217 3.156 2.663
𝑆𝑆𝑇𝑅 = 𝑛 (𝑥̅ − 𝑥̅ )
𝑆𝑆𝐸 = (𝑛 − 1)𝑠
Checking assumptions, waise variances aren't equal so transformation required before using
ANOVA
P-value > 0.05 so accept null hypothesis. Efficiency of all machines is the same