Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views38 pages

Fe Module 5

This document introduces the ARCH model, a tool for understanding and predicting the volatility of financial time series, particularly stock prices. It explains key concepts such as asset return stability, volatility clustering, and the characteristics of return distributions, emphasizing the importance of using returns for analysis over prices. The document also outlines the ARCH(1) model, detailing its equations, properties, and implications for modeling financial data.

Uploaded by

artgonbi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views38 pages

Fe Module 5

This document introduces the ARCH model, a tool for understanding and predicting the volatility of financial time series, particularly stock prices. It explains key concepts such as asset return stability, volatility clustering, and the characteristics of return distributions, emphasizing the importance of using returns for analysis over prices. The document also outlines the ARCH(1) model, detailing its equations, properties, and implications for modeling financial data.

Uploaded by

artgonbi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

LESSON 1

Required Reading
Hey there! 👋 Welcome to your guide on the ARCH model. It might sound complicated, but don't worry! We're going to break it
down with simple examples and analogies to make it super clear. We'll explore why stock prices can be so wild and
unpredictable, and how we can model their crazy swings. Let's get started! 🚀

ARCH MODEL
Get ready to dive into the world of financial time series! We're going to learn about a cool tool called the ARCH model that
helps us understand and predict the ups and downs (volatility) of things like stock prices.

1. Common Features of Financial Asset Returns


When we look at financial data, like the price of a stock, we notice some interesting patterns. Understanding these patterns is
the first step to building powerful models.

1.1 Asset Return Time Series Data is More Stable Than Asset Price Time Series Data
Imagine you have a super fast-growing plant. 🌱 If you just plot its height every day, you'll see a line that mostly goes up, up,
up! This is like a stock's price.

Now, what if you instead plotted how much it grew each day as a percentage? Some days it might grow 2%, some days 1%,
and some days maybe even shrink a tiny bit. This percentage change is the return. You'd notice this plot bounces around an
average level (like 1.5% growth per day), making it much more stable and predictable than the ever-increasing height.

This is why in finance, we often use returns instead of prices:

a. Easy Comparisons: A return is a percentage. This means you can easily compare the performance of a $10 stock with a
$1000 stock. A 5% return is a 5% return, no matter the price! It puts everything on the same scale.
b. Stability: As we saw with the plant analogy, asset prices often have a trend (they go up or down over time), while
returns tend to hover around a stable average (often close to zero for daily returns). This stability makes them much easier
to model.

Let's look at Google's stock. The coursework shows two charts:

Google's Stock Price (2016-2021): This chart looks like a mountain climbing steadily upwards. It has a clear trend.
Google's Stock Return (2016-2021): This chart looks like a fuzzy line bouncing wildly around the zero line. It doesn't
have a trend; it's stable on average, which is great for our models!
c. A Cool Math Trick: For small changes, the return can be calculated easily using logarithms. The formula looks like this:

p t − p t−1
rt = ≈ log(p t ) − log(p t−1 )
p t−1

rt is the return at time t.


pt is the price at time t (today).
p t−1 is the price at time t − 1 (yesterday).

🧠 Quiz Question: Why is it better to use a stock's daily return for analysis instead of its daily price?
Answer: Because returns are more stable and hover around an average, unlike prices which tend to trend up or down. This
makes returns easier to compare and model.

Explanation: Think of it like comparing your test scores. Saying you improved by 10 points is useful, but saying you improved
by 10% (a return) is even better because it gives context, whether you went from 50 to 60 or 80 to 90. Plus, returns don't just
go up forever; they bounce around an average, making them predictable in a different way.
1.2 The Volatility of Asset Returns Can Vary During Different Time Periods
Volatility is just a fancy word for how much something swings up and down. High volatility means big swings (risky!), and low
volatility means small swings (safer).

If you look at the chart of Google's stock returns, you'll see something fascinating. It's not messy in the same way all the time.
There are calm periods with tiny wiggles, and then suddenly there are stormy periods with huge, wild swings.

This is called volatility clustering. 🌪️ It's like weather: you get a few calm, sunny days in a row, and then a week of
thunderstorms. The volatility isn't random; high-volatility days tend to be followed by more high-volatility days, and calm days
are followed by more calm days. This tells us that the past volatility has an impact on today's volatility.

🧠 Quiz Question: What is volatility clustering?


Answer: It's the tendency for periods of high volatility (big price swings) and low volatility (small price swings) to occur in
groups or clusters.

Explanation: It's like a bag of popcorn. You'll hear a few pops, then a whole bunch of them popping like crazy, and then it
calms down again. The "crazy popping" periods are clustered together.

1.3 Asset Return Distribution Has Heavier Tails Than Normal Distribution
Imagine we plot every daily return of Google stock on a histogram. Most days, the return is very close to zero. But you'll also
see a surprising number of days where the stock had a huge gain or a massive loss.

This gives the distribution "heavy tails" or "fat tails." It means extreme events (big jumps or drops) happen more often than
you'd expect in a perfect "bell curve" (a Normal Distribution).

The coursework shows a histogram and a "Normal QQ Plot" for Google's returns.

Histogram: It looks mostly like a bell curve, but the bars at the far left (big losses) and far right (big gains) are taller than a
perfect bell curve would predict.
Normal QQ Plot: This plot compares Google's returns to a perfect normal distribution. If the returns were perfectly normal,
the dots would all fall on a straight red line. But we see the dots curve away from the line at the ends, which is the classic
sign of fat tails!

🧠 Quiz Question: What does it mean for a stock's return distribution to have "heavy tails"?
Answer: It means that extreme outcomes (very large gains or very large losses) are more common than a normal bell curve
would suggest.

Explanation: Think about grades in a class. A normal distribution would have most people getting a B or C. A heavy-tailed
distribution would have a surprising number of students getting an A+ or an F.

2. Conditional Means and Conditional Variances


These sound technical, but the idea is simple. "Conditional" just means "depending on" or "given some information."

Conditional Mean: This is the expected average of something, given what we already know.
Analogy: What's the average height of a person? That's a regular mean. What's the average height of a person given
that they are a professional basketball player? That's a conditional mean, and it will be much higher!
Conditional Variance: This is the expected spread or volatility of something, given what we already know.
Analogy: What's the typical range of temperatures in a year? That's regular variance. What's the range of
temperatures given that it's July? That's a conditional variance, and it will be much smaller and hotter than the
yearly average.
The formulas are just a mathematical way of saying this:

Conditional Mean of Y given X:

μ = E(Y |X)
Y |X

Conditional Variance of Y given X:


2 2 2
σ = E(Y |X) − μ
Y |X Y |X

In our stock return models, we use past information (like yesterday's return, X t−1
) to predict today's return (X ).
t

The ARMA models you learned about predict the conditional mean. They say:

X t = A f unction of past values + random noise

But we saw that volatility changes! So, we need to model the conditional variance too. We can update our model to look like
this:

X t = (A f unction of past values) + (Today’s predicted volatility) × (random shock)

The key idea is that today's volatility isn't constant; it's a function that depends on past values! This is exactly what ARCH and
GARCH models do. For the rest of this lesson, we'll assume the conditional mean of asset returns is zero (they just bounce
around 0) and focus only on the exciting part: the changing variance.

🧠 Quiz Question: In simple terms, what's the difference between a conditional mean and a conditional variance?
Answer: A conditional mean is a predicted average based on some information (e.g., expected temperature in summer). A
conditional variance is the predicted spread or risk based on some information (e.g., expected temperature range in summer).

3. ARCH(1) Model
Let's meet our first volatility model! ARCH stands for AutoRegressive Conditional Heteroskedasticity. Whoa, what a mouthful!
Let's break it down:

AutoRegressive (AR): Means it uses past values of itself to predict the future. Just like an AR model for returns.
Conditional (C): The prediction depends on what happened recently.
Heteroskedasticity (H): A fancy word for changing or non-constant variance. (Homo-skedasticity means constant
variance).

So, an ARCH model is one where the changing variance depends on past values.

The simplest version is the ARCH(1) model, which only looks back one day. It has two simple equations:

Equation 1: The Return Equation


rt = σt et

rt : Today's return.
σt : Today's volatility (standard deviation).
et : A random shock, like a dice roll. It's usually a standard normal variable (mean 0, variance 1).

This equation says that today's return is just a random shock scaled up or down by today's volatility. If volatility (σ ) is high, t

the return can be huge (positive or negative). If it's low, the return will be small.

Equation 2: The Variance Equation


2 2
σ t = α 0 + α 1 r t−1

2
σt : Today's variance (volatility squared).
α0 : A constant, baseline level of variance. The "normal" amount of chaos.
α1 : A number between 0 and 1 that says how much yesterday's shock affects today's variance.
r
2
t−1
: Yesterday's return, squared. We square it because we care about the size of the move (e.g., -2% and +2% have the
same magnitude), not the direction.

This is the magic! It says today's volatility is determined by a baseline level (α ) plus a fraction (α ) of how big yesterday's
0 1

move was (r ). If yesterday had a huge return (a big shock), today's variance will be higher! This is how the model creates
2
t−1

volatility clustering.

Diagram:

Yesterday

Yesterday's Return
rt-1

Square it!
r2t-1

Today

Get a Random Shock Calculate Today's Variance


et σ2t = α0 + α1r2t-1

Calculate Today's Return


rt = σtet

Properties of the ARCH(1) Model


Let's quickly go through some of its cool features.

3.1 ARCH(1) is Strictly Stationary

This just means the process doesn't explode to infinity. As long as α is less than 1, the model will always stay under control.
1
3.2 r is Conditionally Normally Distributed
t

If we know what yesterday's return (r ) was, then we know today's variance (σ ). This means today's return will follow a
t−1
2
t

normal distribution (a bell curve) with a mean of 0 and that specific variance.
2
r t |r t−1 ∼ N (0, α 0 + α 1 r t−1 )

3.3 r is White Noise


t

This is a bit tricky but super important. If you look at the returns unconditionally (without knowing the past), they look totally
random!

The average (unconditional mean) is 0.


The returns are uncorrelated with each other.
The variance is constant (unconditionally homoscedastic) and equals:
α0
V ar(r t ) =
1 − α1

So, the series is conditionally heteroskedastic (variance changes based on yesterday) but unconditionally
homoscedastic (the long-run average variance is constant).

3.4 r Is Not i.i.d.


t

Even though the returns are uncorrelated, they are not independent. The volatility of today's return clearly depends on the
size of yesterday's return. This is a great example that "uncorrelated" does not automatically mean "independent"!

3.5 r IS A non-Gaussian AR(1) Process


2
t

This is the big clue for detectives! 🕵️ While the returns (r ) look like random noise, the squared returns (r ) behave just like
t
2
t

an AR(1) process.
2 2
r t = α 0 + α 1 r t−1 + ν t

This means we can use our old tools (ACF and PACF plots) on the squared returns to see if an ARCH(1) model is a good fit. If
the PACF of the squared returns cuts off after lag 1, we've found our suspect!

3.6 r Has Heavier Tails Than Standard Normal Distribution


t

The ARCH model naturally produces more extreme outcomes (fat tails), just like we see in real financial data. The periods of
high volatility allow for huge price swings to happen.

3.7 ARCH(m) Process

Why only look back one day? An ARCH(m) model looks back m days. The variance equation just gets a bit longer:
2 2 2 2
σ t = α 0 + α 1 r t−1 + α 2 r t−2 + ⋯ + α m r t−m

This lets today's volatility be influenced by the events of the last few days, not just yesterday.

🧠 Quiz Question: In an ARCH(1) model, if a stock has a huge price drop yesterday, what is likely to happen to its volatility
today?

Answer: Its volatility is likely to increase.

Explanation: The ARCH(1) variance equation is σ = α + α r . A huge price drop means r


2
t 0 1
2
t−1
is a large negative number, but
t−1

r
2
will be a large positive number. This makes today's variance, σ , bigger, leading to higher volatility.
t−1
2
t

4. ARCH(1) Simulation
Let's see an ARCH(1) process in action! The coursework created a fake time series using the ARCH(1) rules with α 0 = 5 and
α 1 = 0.5 .
The Simulation Plot: The chart shows a series that mostly wiggles around zero. But then, there are clear "clusters" where
it goes wild with big spikes up and down (like around time=600). This looks just like the real Google returns data! It
successfully created volatility clustering.

Now, let's play detective with the ACF and PACF plots.

ACF and PACF of the Returns: The plots show no significant bars sticking out of the blue shaded area. This means the
returns are uncorrelated and look like random white noise. If we stopped here, we'd miss the whole story!
ACF and PACF of the SQUARED Returns: This is where we find the evidence!
The ACF plot shows bars that slowly decay, like a ramp going down.
The PACF plot shows one big, significant bar at lag 1, and then all the other bars are not significant (they are inside
the blue area).

This "decaying ACF and PACF cut-off at lag 1" pattern is the classic fingerprint of an AR(1) process. Since we found it in the
squared returns, we can be confident that an ARCH(1) model is a great way to describe this data.

5. Conclusion
Awesome job! 🎉 You've learned about the key features of financial returns, like volatility clustering and heavy tails. You've
also been introduced to your first volatility model, ARCH(1), which can capture these features beautifully. The main trick is to
remember that while returns might look random, their squared values can tell a hidden story.

In the next lesson, we'll build on this by learning about the GARCH model, which is like ARCH's even more popular and
powerful cousin. See you there! 🌟

Main Lesson 1
Hey there! 👋 Let's try this one more time, and I am committed to getting it exactly right for you. My apologies for the
previous omissions. We are going to walk through this Volatility Modeling coursework, and I will make sure to cover every
single detail, formula, and concept presented, explaining it all in a clear and engaging way.

Think of volatility as a measure of uncertainty or risk in finance. It’s the "wobble" in a stock's price. Let's learn how the pros
measure, predict, and model this wobble. 🎢

1. Defining and Measuring Volatility


At its heart, volatility is the annualized standard deviation of the change in price of a financial security. It’s a standardized
way to talk about how much an asset's price swings around.

There are several ways to tackle this:

Historical/sample volatility: Looking at past data.


Geometric Brownian Motion Model: A classic model assuming constant volatility.
Poisson Jump Diffusion Model: An upgrade to GBM that includes sudden jumps.
ARCH/GARCH Models: Models where volatility itself changes over time.
Stochastic Volatility (SV) Models: Another type of model with changing volatility.
Implied volatility: Figuring out volatility from the price of options.

Computing Volatility from Historical Prices


This is the most fundamental approach: using past data to measure past volatility.

Step 1: Get Prices and Calculate Returns 📈


You start with a series of prices over T + 1 time points: {P t
. Then, you convert these prices into T periods of
, t = 0, 1, 2, . . . , T }

returns. Log returns are very common in finance.

R t = log(P t /P t−1 )

Rt is the return for the period t.


Pt is the price at time t (e.g., today's closing price).
P t−1 is the price at the previous time point (e.g., yesterday's closing price).
σ
^

Step 3: Annualize It! 🗓️

⎪
Step 2: Calculate the Standard Deviation (σ) 📊
The standard deviation measures how much the returns are spread out from their average. This is the raw, periodic volatility.
The theoretical formula is:

is our calculated sample standard deviation.


is the number of returns we have.
is the simple average of all our returns.
σ = √ var(R t ) = √ E[(R t − E[R t ]) 2 ]

When we use our actual data, we calculate the sample estimate like this:

^ =
σ

Absolute Error). A famous benchmark methodology is RiskMetrics.


~2
1

⎷ T − 1

~2
σ t+1 =

σ t+1 =
m
T

∑(R t − R̄)

t=1

σ t+1 = (1 − β)^

can be fit on all data or on a "rolling window" of the most recent data.
~2
σ t+1 = γ 1,t σ
2
^ t + γ 2,t σ
2
2
∑σ
^j

j=1

∑ σ

~2
2
2

To compare the volatility of different assets, we convert the periodic volatility (like daily) into a yearly figure. You do this by
multiplying by the square root of the number of periods in a year.

ˆ = ⎨√
vol
⎧√ 252^


52^
σ

√ 12^

Prediction Methods Based on Historical Volatility


σ
σ (f or daily prices, using 252 business days/year)

(f or weekly prices)

(f or monthly prices)

Now, let's use past volatility to forecast future volatility.

Historical Average: Predicts that future variance will be the average of all variances up to the current point. This uses all
data but can be slow to react to new market conditions.
t

Simple Moving Average: Uses the average of only the last 'm' periods. It's more responsive because it drops old data.
m−1

2
^ t−j

j=0

Exponential Moving Average: A smart blend that uses all available data but gives more weight to recent events. The
parameter β (between 0 and 1) controls how quickly the influence of old data fades.
~2
σ t + βσ t

Simple Regression: Uses a regression model to predict tomorrow's variance based on the last 'p' periods of variance. This

^ t−1 + ⋯ + γ p,t σ
2
^ t−p+1 + u t

The Key Trade-Off: You must balance using lots of data for a precise estimate versus using recent data that is more relevant
for today's market. After making forecasts, you evaluate them using metrics like MSE (Mean Squared Error) and MAE (Mean

🧠 Quiz Question: What is the main trade-off when deciding how much historical data to use for a volatility forecast?
Answer: It's a trade-off between precision and relevance.

Explanation: Using more data (a longer history) can give you a more statistically stable and precise estimate. However, using
only recent data ensures your forecast is more relevant to the current market environment, which might be very different from
the market 10 years ago.
2. Geometric Brownian Motion (GBM)
GBM is a foundational model for asset prices that assumes they move randomly but with a general trend. Think of it like a
remote-controlled car with a wonky wheel: it drives in a general direction, but it's constantly wobbling randomly.

The core equation is:

dS(t) = μS(t)dt + σS(t)dW (t)

dS(t) is the tiny, infinitesimal change in the security's price S(t).


μ is the mean return, or the "drift" that pushes the price in a direction.
σ is the volatility, the size of the random wobbles.
dW (t) is a tiny increment from a Wiener Process (or Brownian Motion), which represents the random shock. These shocks
are independent over time and are Gaussian (normally distributed).

Under GBM, the log-returns of the asset are normally distributed. The parameters μ and σ can be estimated from data using
Maximum-Likelihood Estimation. For periods of equal length (Δ j ), the estimates are:
= 1

μ
^ ∗ = R̄

2
1 2
σ
^ = ∑(R t − R̄)
n
1

Garman-Klass Estimator: A Volatility Upgrade


The GBM model's biggest flaw is assuming constant volatility. But even within that world, we can get better at measuring that
volatility. The Garman-Klass estimator is a brilliant way to do this by using more than just the closing price; it uses the Open,
High, Low, and Close prices (OHLC) for a period.

Think of it like this: if you want to know how much the temperature changed in a day, just knowing the temperature at 5 PM
(the close) isn't the full story. Knowing the day's high and low gives you a much better picture of the temperature's volatility.

The coursework presents a series of increasingly sophisticated (and efficient) estimators. "Efficiency" here means how good
the estimate is. An efficiency of 7.4 means the estimator is 7.4 times better (less noisy) than the basic close-to-close
estimator!

Here are some of the estimators mentioned:

Close-to-Close: σ
^
2
0
= (C 1 − C 0 )
2
. The simplest method, with a base efficiency of 1.
Parkinson (1976): Uses the high and low prices, achieving an efficiency of about 5.2.
2
2
(H 1 − L 1 )
σ
^3 =
4(log 2)

Garman and Klass (1980): Created a "Best Analytic Scale-Invariant Estimator" that combines the high, low, and close
data relative to the open price. This estimator achieves an efficiency of about 7.4.
2 2 2
σ
^ ∗∗ = 0.511(u − d) − 0.019{c(u + d) − 2ud} − 0.383c

The Composite GK Estimator: This is the final version that also includes the overnight jump from the previous day's close
(C ) to the current day's open (O ). This powerful estimator reaches a final efficiency of about 8.4.
0 1

2 2
2
(O 1 − C 0 ) σ ∗∗
σ
^ GK = a + (1 − a)
f (1 − f )

3. Poisson Jump Diffusions (PJD)


GBM handles the normal, smooth wobbles. But markets sometimes have sudden, shocking jumps from unexpected news. The
PJD model adds a jump component to GBM to make it more realistic.

dS(t)
= μdt + σdW (t) + γσZ(t)dΠ(t)
S(t)
The new piece is the jump term, dΠ(t), which is an increment from a Poisson Process. This process models the occurrence of
a jump, while the Z(t) term models the magnitude of that jump. This allows the model to have both small, continuous changes
and large, sudden ones.

4. ARCH Models
The Autoregressive Conditional Heteroskedasticity (ARCH) model was a game-changer because it was designed for a
world where volatility is not constant.

The core idea is that today's volatility depends on the size of yesterday's surprises (or shocks). The model for the return y is: t

yt = μt + ϵt

Where the shock is ϵ t


= Zt σt . The magic is in the variance equation:
2 2 2 2
σ t = α 0 + α 1 ϵ t−1 + α 2 ϵ t−2 + ⋯ + α p ϵ t−p

This says that today's variance, σ , is a weighted average of past squared shocks (ϵ ). This means the squared shocks follow
2
t
2

an AR(p) process.

Finding and Fitting ARCH Models


Lagrange Multiplier (LM) Test: This is a statistical test to see if ARCH effects are present in your data. You test the
hypothesis that all the α coefficients (except α ) are zero. If the test statistic (nR ) is big, you reject this idea and conclude
0
2

that an ARCH model is needed.


Maximum Likelihood Estimation (MLE): This is the statistical method used to find the best-fitting α coefficients. It works
by writing down a "likelihood function" and finding the coefficients that make your observed data the most probable
outcome. The ARCH model must satisfy constraints like α i
≥ 0 and ∑ α i
< 1 for it to be stable.

5. GARCH Models
The Generalized ARCH (GARCH) model by Bollerslev (1986) is an extension of ARCH and is hugely popular in finance.

GARCH says today's variance depends on yesterday's shocks AND yesterday's variance. It has "memory." The widely used
GARCH(1,1) model is:
2 2 2
σ t = α 0 + α 1 ϵ t−1 + β 1 σ t−1

α1 ϵ
2
t−1
is the ARCH term (reaction to news).
β1 σ
2
t−1
is the GARCH term (persistence of volatility).

This structure implies that the squared shocks, ϵ , follow an ARMA(1,1) process. For the GARCH model to be stable
2
t

(stationary), the coefficients must sum to less than 1: |α 1 + β1 | < 1 .

Features and Properties of GARCH


GARCH is powerful because it explains real-world market behavior:

Volatility Clustering: A large shock or high variance yesterday leads to high variance today, creating clusters of high and
low volatility.
Heavy Tails: GARCH models naturally produce distributions with more extreme outcomes (higher kurtosis) than a simple
Gaussian distribution, which is true for real returns.
Mean Reversion: Volatility doesn't explode to infinity. It always tends to revert to its long-run average variance, which for
GARCH(1,1) is:
α0
2
σ∗ =
(1 − α 1 − β 1 )

The model can be rewritten to explicitly show this mean-reverting behavior:


2 2 2 2
(ϵ t − σ ∗ ) = (α 1 + β 1 )(ϵ t−1 − σ ∗ ) + u t − β 1 u t−1
Model Selection and Other Variants
After fitting a GARCH model with MLE, you must check if it's a good fit by testing its residuals. To choose between different
models (e.g., GARCH(1,1) vs. GARCH(2,1)), you use criteria like AIC and BIC—the model with the lower score wins.

Many other advanced GARCH models exist, such as:

EGARCH (Nelson, 1992)


TGARCH (Glosten, Jagannathan, Runkler, 1993)
PGARCH (Ding, Engle, Granger)

🧠 Quiz Question: What is the key difference in the variance equation between an ARCH(1) model and a GARCH(1,1) model?
Answer: The GARCH(1,1) model adds a term for the previous period's variance, β 1
σ
2
t−1
.

Explanation: ARCH(1) models today's variance based only on yesterday's shock (ϵ 2


t−1
). GARCH(1,1) is more sophisticated
because it models today's variance based on both yesterday's shock and yesterday's variance level, giving it a form of
memory or persistence.

LESSON 2
Welcome to your first class on Bayesian Updating! 🎉 This might sound super advanced, but it's actually a really cool and
logical way to update our beliefs when we get new information. Think of yourself as a detective. You start with a guess (a
prior belief), then you find a clue (the data), and that clue helps you update your guess to be more accurate (a posterior
belief). Let's dive in!

Bayesian Updating with Discrete Priors 🚀


1. Learning Goals 🎯
Here’s what we're going to master today:

1. Use Bayes' Theorem: We'll learn how to use a special formula to figure out probabilities like a pro.
2. Know the Lingo: We'll get familiar with key terms like prior, likelihood, posterior, data, and hypothesis. You'll see how
each one plays a role in our detective work.
3. Use Update Tables: We'll learn how to organize all our work in a simple table that makes complex problems way easier to
solve.

2. A Quick Look at Bayes' Theorem Again 🧐


Remember Bayes' theorem? It's the super-powered formula that lets us flip conditional probabilities around. If we have a
Hypothesis (our guess) called H and some Data (our clue) called D, the formula looks like this:

P (D|H)P (H)
P (H|D) =
P (D)

Let's break that down:

: The probability that our hypothesis is true, given the data we saw. This is what we usually want to find! It's our
P (H|D)

updated belief, or posterior.


P (D|H) : The probability of seeing that data, if our hypothesis was true. This is called the likelihood.
P (H) : The probability of our hypothesis being true before we saw any data. This is our starting guess, or prior.
P (D) : The overall probability of seeing that data, no matter what.

2.1 The Base Rate Fallacy 🤔


A common mistake is thinking that P (H|D) is the same as P (D|H). For example, just because most people with a rare disease
test positive, it doesn't mean that most people who test positive have the disease. We'll see why later on!

Quiz Question: Imagine your friend loves pizza. The probability that your friend is happy if they eat pizza is high. Does this
mean the probability they ate pizza if they are happy is also high?
Answer: Not necessarily!
Explanation: They could be happy for lots of other reasons, like acing a test or getting a new video game. This is the core
idea of Bayes' theorem – it helps us figure out the real probability instead of just jumping to conclusions!

3. New Words and Using a Table 📝


Let's use a fun example with coins to learn the new terms and see how to use a table to keep everything neat.

Example 1: I have a drawer with 5 special coins:

Type A: 2 fair coins (50% chance of heads)


Type B: 2 bent coins (60% chance of heads)
Type C: 1 super-bent coin (90% chance of heads)

I pick one coin at random, flip it, and it lands on HEADS. What's the probability it was a Type A, Type B, or Type C coin?

Let's break this down with our new detective lingo:

Experiment 🧪: Picking a coin and flipping it.


Data (Our Clue) 📊: We got Heads. We'll call this event D.
Hypotheses (Our Guesses) 🤔:
Hypothesis A: The coin is Type A.
Hypothesis B: The coin is Type B.
Hypothesis C: The coin is Type C.
Prior Probability (Starting Guesses) 🎲: Before we flipped the coin, what were the chances of picking each type? There
are 5 coins in total.
P (A) = 2/5 = 0.4 (40% chance)
P (B) = 2/5 = 0.4 (40% chance)
P (C) = 1/5 = 0.2 (20% chance)
Likelihood (Chance of the Clue) 💡: How likely were we to get Heads if we had a certain coin?
P (D|A) = 0.5 (The chance of heads with a Type A coin)
P (D|B) = 0.6 (The chance of heads with a Type B coin)
P (D|C) = 0.9 (The chance of heads with a Type C coin)
Posterior Probability (Updated Guesses) 📈: What we want to find! The probability of each coin type after we saw that
we got Heads.
P (A|D) , P (B|D), and P (C|D).

Here's a diagram of our situation:

Diagram:

Start

Type A Type B Type C


P=0.4 P=0.4 P=0.2

Heads Tails Heads Tails Heads Tails


P=0.5 P=0.5 P=0.6 P=0.4 P=0.9 P=0.1

To use Bayes' formula, we first need the total probability of getting heads, P (D). We find this by adding up the chances of
getting heads from each coin type:

P (D) = (0.5 × 0.4) + (0.6 × 0.4) + (0.9 × 0.2) = 0.20 + 0.24 + 0.18 = 0.62
Now we can find our posterior probabilities!

0.5×0.4 0.20
P (A|D) = = ≈ 0.3226
0.62 0.62

0.6×0.4 0.24
P (B|D) = = ≈ 0.3871
0.62 0.62

0.9×0.2 0.18
P (C|D) = = ≈ 0.2903
0.62 0.62

Instead of doing all that math separately, let's use a Bayesian Update Table. It makes life so much easier!

Hypothesis (H) Prior P(H) Likelihood P(D|H) Bayes Numerator P(D|H)P(H) Posterior P(H|D) (Numerator / Total)

A 0.4 0.5 0.4 × 0.5 = 0.20 0.20/0.62 ≈ 0.3226

B 0.4 0.6 0.4 × 0.6 = 0.24 0.24/0.62 ≈ 0.3871

C 0.2 0.9 0.2 × 0.9 = 0.18 0.18/0.62 ≈ 0.2903

Total 1 (No Sum) P(D) = 0.62 1

This whole process of starting with a prior and using data to get a posterior is called Bayesian Updating. You just updated
your beliefs with evidence! 🧠

3.1 Important Things to Notice


1. Two Kinds of Probabilities: Notice we have probabilities about the data (like the chance of heads) and probabilities about
our hypotheses (like the chance the coin is Type A).
2. Beliefs Change: After seeing 'Heads', our belief shifted. Type B is now the most likely culprit, even though it started tied
with Type A. The chance for Type C went up a lot because Type C coins are great at producing heads!
3. The Numerator is Key: The "Bayes Numerator" column is the most important part of the calculation. The final "Posterior"
column is just that column with each number divided by the total, so they all add up to 1.
4. The "Tug-of-War": The posterior is a tug-of-war between the prior and the likelihood. Type C had a low prior (0.2) but a
huge likelihood (0.9), so its posterior probability got a big boost.
5. MLE isn't Everything: The Maximum Likelihood Estimate (MLE) is the hypothesis with the highest likelihood. Here, that's
Type C (0.9). But it didn't "win" because its prior probability was low. The posterior probability gives us a more complete
picture.

So, we can write Bayes' theorem in a few ways:

P (data|hypothesis)P (hypothesis)
P (hypothesis|data) =
P (data)

Or even more simply, as a proportion:

P (hypothesis|data) ∝ P (data|hypothesis)P (hypothesis)

The symbol ∝ just means "is proportional to." It tells us that the posterior is driven by the prior multiplied by the likelihood.

3.2 Let's Use Some Variables (PMFs)


To make things look more "mathy," we can use variables.

Let θ (theta) be the value for our hypothesis. For our coins, θ could be 0.5, 0.6, or 0.9.
Let x be our data. We'll say x = 1 for Heads and x = 0 for Tails.
The prior pmf p(θ) is just our table of prior probabilities.
The posterior pmf p(θ|x = 1) is our table of posterior probabilities after seeing one head.

Hypothesis θ Prior p(θ) Posterior p(θ∥x = 1)

A 0.5 0.4 0.3226

B 0.6 0.4 0.3871


C 0.9 0.2 0.2903
Imagine these as bar charts. Before the flip, the bars for 0.5 and 0.6 were the same height. After seeing a head, the bar for 0.6
is the tallest, and the bar for 0.9 grew a lot!

Example 2: What if the flip was TAILS (x = 0)? Let's redo the table!
The probabilities of getting tails are:

Type A (θ = 0.5): P (x = 0|θ = 0.5) = 1 − 0.5 = 0.5


Type B (θ = 0.6): P (x = 0|θ = 0.6) = 1 − 0.6 = 0.4
Type C (θ = 0.9): P (x = 0|θ = 0.9) = 1 − 0.9 = 0.1

Let's plug those into a new update table:

Hypothesis (θ) Prior p(θ) Likelihood p(x = 0∥θ) Bayes Numerator p(x = 0∥θ)p(θ) Posterior p(θ∥x = 0)

0.5 0.4 0.5 0.4 × 0.5 = 0.20 0.20/0.38 ≈ 0.5263

0.6 0.4 0.4 0.4 × 0.4 = 0.16 0.16/0.38 ≈ 0.4211

0.9 0.2 0.1 0.2 × 0.1 = 0.02 0.02/0.38 ≈ 0.0526

Total 1 (No Sum) 0.38 1

Wow! After getting tails, the fair coin (Type A) is now the most likely. The super-bent coin that loves heads (Type C) is now
extremely unlikely. Our beliefs updated correctly based on the new clue!

Quiz Question: In the "Tails" example, why did the posterior probability for Type C (θ = 0.9) drop so much?
Answer: Because Type C coins are really good at getting heads (90% chance).
Explanation: Seeing a "Tails" is very surprising if the coin is Type C. This surprising data provides strong evidence against the
hypothesis that the coin was Type C, so its probability plummets.

4. Updating Again and Again 🔄


What if we get more clues? Easy! The posterior from our first update becomes the new prior for our second update.

Example 3: Let's go back to our first example. We flipped the coin and got Heads. Now, we flip the same coin a second time
and get Heads again. What are the probabilities now?

We can just add more columns to our table! The "Bayes Numerator 1" is the result from our first flip. We multiply that by the
likelihood of getting a second head to get "Bayes Numerator 2".

Hypothesis ( Prior Likelihood 1 Bayes Likelihood 2 Bayes Numerator Posterior 2 (After 2


θ) p(θ) (Head) Numerator 1 (Head) 2 Heads)

0.5 0.4 0.5 0.20 0.5 0.20 × 0.5 = 0.100 0.100/0.406 ≈ 0.2463

0.6 0.4 0.6 0.24 0.6 0.24 × 0.6 = 0.144 0.144/0.406 ≈ 0.3547

0.9 0.2 0.9 0.18 0.9 0.18 × 0.9 = 0.162 0.162/0.406 ≈ 0.3990

Total 1 0.406 1

After two heads in a row, the Type C coin has finally become the most likely! The repeated evidence for 'Heads' was so strong
it overcame Type C's low prior probability. Each new piece of data helps us zero in on the most likely hypothesis.

5. Appendix: The Base Rate Fallacy (Disease Test Example) 🏥


Let's look at that tricky idea from the beginning with a real-world example.

Example 4: A screening test for a rare disease is very good.


It correctly identifies 99% of people who have the disease (True Positive).
It incorrectly says 2% of healthy people have the disease (False Positive).

The disease is rare: only 0.5% of the population has it. If a random person tests positive, what is the probability they actually
have the disease?

Let's set up a table.

Hypotheses: Have Disease (H ), Don't Have Disease (H ).


+ −

Data: Test is Positive (T ).


+

Hypothesis (H) Prior P(H) Likelihood P(T |H)


+ Bayes Numerator Posterior P(H|T )
+

H+ (Have Disease) 0.005 0.99 0.00495 0.00495/0.02485 ≈ 0.1992

H− (No Disease) 0.995 0.02 0.01990 0.01990/0.02485 ≈ 0.8008

Total 1 (No Sum) 0.02485 1

The result is shocking! Even with a positive test from a 99% accurate test, the chance you actually have the disease is only
about 20%!

Why? This is the base rate fallacy. The "base rate" (the prior probability of having the disease) is incredibly low (0.5%). There
are so many more healthy people than sick people that the small percentage of "false positives" from the healthy group
creates a bigger pool of positive tests than the "true positives" from the sick group.

Quiz Question: In the disease example, what is the "base rate"?


Answer: The prior probability of having the disease, which is 0.5%.
Explanation: The "base rate" is your starting point—the general prevalence of something in a population before you get any
new evidence. Ignoring this base rate is a common thinking error that Bayesian updating helps us avoid!

Welcome to the fascinating world of probabilistic prediction! 🎉 This is all about making smart, number-based guesses about
the future, from predicting the weather to the outcome of a video game. Let's learn how to update our predictions as we get
new information!

Bayesian Updating: Making Smarter Predictions


1. Learning Goals 🎯
Our main goal is to learn how to make predictions before we have any data (called prior predictive probability) and how to
update those predictions after we get new information (called posterior predictive probability). Think of it as making an initial
guess and then refining it as you learn more!

2. Introduction 🚀
In our last lesson, we saw how to update our beliefs about something (like which type of coin we're holding) based on new
evidence. Now, we're taking it a step further: we'll use that new evidence to predict what might happen next!

2.1 Probabilistic Prediction


Imagine you're talking about a big soccer match. There are a few ways you could predict the outcome:

Simple Prediction: "Team A will win." (A simple, bold statement.)


Words of Estimative Probability (WEP): "It is likely that Team A will win." (This adds a level of certainty but is still a bit
vague. What does "likely" mean?)
Probabilistic Prediction: "There is a 60% chance that Team A will win." (This is super clear and uses numbers, which is
what we'll focus on!)

Using precise probabilities helps us understand the world better. It's used everywhere:

🌦️ Weather Forecasting: "There's a 90% chance of rain."


🩺 Medical Outcomes: "This treatment has an 80% success rate."
🗳️ Elections: "Candidate X has a 55% chance of winning."
⚽ Sports Betting: "This team has a 70% chance of scoring the next goal."
This method avoids confusing "weasel words" like "it might happen" or "it's conceivable," which don't really tell us much.

Quiz Question: Which of the following is a probabilistic prediction?


a) "I might go to the movies tonight."
b) "Our team will probably win."
c) "There is a 25% chance of our team winning."

Answer: c) "There is a 25% chance of our team winning."


Explanation: This is the only option that uses a specific number (a probability) to make a prediction.

3. Predictive Probabilities
Probabilistic prediction is all about assigning a specific probability to every possible outcome. Let's use a fun example from
our last class to see how it works.

Imagine a drawer with 4 special coins:

2 Type A coins: Perfectly fair, with a 50% chance of heads (P (H eads|A) = 0.5).
1 Type B coin: Slightly biased, with a 60% chance of heads (P (H eads|B) = 0.6).
1 Type C coin: Very biased, with a 90% chance of heads (P (H eads|C) = 0.9).

You reach in and pick one coin at random. What's the chance you grabbed each type?

2 Type A coins
P (A) = = 0.5
4 total coins

1 Type B coin
P (B) = = 0.25
4 total coins

1 Type C coin
P (C) = = 0.25
4 total coins

These are our prior probabilities—our beliefs before we've flipped the coin.

3.1 Prior Predictive Probabilities


Before we flip the coin, what's our best guess for the probability of getting heads? This is our prior predictive probability. To
figure it out, we'll use the Law of Total Probability.

Think of it like this: your overall chance of getting heads is a weighted average of the chances for each coin type. The
"weight" is the probability of having picked that coin type in the first place.

Diagram:

Pick a Coin

Overall Probability Start

P(A) = 0.5 P(B) = 0.25 P(C) = 0.25

Coin A Coin B Coin C

P(H|A) = 0.5 P(H|B) = 0.6 P(H|C) = 0.9 P(T|A) = 0. P(T|B) = 0.4 P(T|C) = 0.

Flip the Coin

Heads Tails

Here's the formula:

P (D H ) = P (D H |A)P (A) + P (D H |B)P (B) + P (D H |C)P (C)

Let's break it down:

: The total probability of getting heads on the first flip.


P (D H )
P (D H |A): The probability of heads if you have coin A (which is 0.5).
P (A) : The probability you picked coin A (which is 0.5).

Plugging in the numbers:

P (D H ) = (0.5 × 0.5) + (0.6 × 0.25) + (0.9 × 0.25)

P (D H ) = 0.25 + 0.15 + 0.225 = 0.625

So, before flipping, we predict a 62.5% chance of getting heads.

Similarly, for tails (D ):


T

P (D T ) = (0.5 × 0.5) + (0.4 × 0.25) + (0.1 × 0.25)

P (D T ) = 0.25 + 0.10 + 0.025 = 0.375

Notice that 0.625 + 0.375 = 1, which makes perfect sense!

Quiz Question: What does "prior predictive probability" mean?


a) The probability of a hypothesis after seeing data.
b) The probability of an outcome before collecting any data.
c) The probability of an outcome after collecting data.

Answer: b) The probability of an outcome before collecting any data.


Explanation: "Prior" means before, and "predictive" means we are predicting an outcome (like heads or tails).

3.2 Posterior Predictive Probabilities


Now for the exciting part! Let's say we flip the coin and get heads. This is new data (D)! This new information should change
our beliefs about which coin we're holding. We use a Bayes table to update our probabilities.

Hypothesis (Coin Type) Prior P(H) Likelihood P(D|H) Bayes Numerator P(D|H)P(H) Posterior P(H|D)

A 0.50 0.5 0.250 0.250 / 0.625 = 0.40

B 0.25 0.6 0.150 0.150 / 0.625 = 0.24


C 0.25 0.9 0.225 0.225 / 0.625 = 0.36

Total 1 0.625 1

Our new, updated beliefs (posterior probabilities) are:

P (A|D) = 0.4 (The chance it's coin A has gone down)


P (B|D) = 0.24 (Slightly down)
P (C|D) = 0.36 (The chance it's coin C has gone up, because C is great at making heads!)

Now, we want to predict the outcome of a second flip. This is the posterior predictive probability. We do the exact same
calculation as before, but we use our new posterior probabilities as the weights!

Let's predict heads on the second flip, given we got heads on the first (D H2
|D H 1 ):

P (D H 2 |D H 1 ) = P (D H |A)P (A|D H 1 ) + P (D H |B)P (B|D H 1 ) + P (D H |C)P (C|D H 1 )

P (D H 2 |D H 1 ) = (0.5 × 0.40) + (0.6 × 0.24) + (0.9 × 0.36)

P (D H 2 |D H 1 ) = 0.20 + 0.144 + 0.324 = 0.668

Our prediction for heads on the next toss has increased from 62.5% to 66.8%! This is because the first heads made us
suspect we have a coin that's good at landing heads (like Type C), so we adjust our future predictions accordingly.

Quiz Question: Why did the probability of getting heads on the second toss increase after we saw heads on the first toss?
a) Because the coin physically changed.
b) Because the first result made us more confident, which affects the coin.
c) Because the first result updated our belief, making us think it's more likely we have a coin that favors heads.

Answer: c) Because the first result updated our belief, making us think it's more likely we have a coin that favors heads.
Explanation: The coin itself doesn't change, but our knowledge about it does! The data (getting heads) points towards the
coin being type B or C, which have a higher probability of landing heads.

3.3 Review 📝
Let's break down the key terms to avoid confusion.

Prior/Posterior Probabilities are for hypotheses.


Analogy: This is your belief about which player is taking the shot (e.g., "I'm 50% sure it's the star striker").
Prior Predictive/Posterior Predictive Probabilities are for outcomes.
Analogy: This is your prediction of whether they will score (e.g., "There's a 60% chance of a goal on this shot").

Essentially, you're using your belief about the hypothesis (which coin is it?) to predict the outcome (will it be heads?). As you
get data, your belief about the hypothesis changes, and so does your prediction for the next outcome!

Welcome back! 🧠 Today, we're taking a giant leap in our ability to make predictions. Before, we dealt with a handful of
choices, like guessing if a die has 4, 6, or 8 sides. Now, we're going to learn how to handle a smooth, continuous range of
possibilities, like a bent coin where the chance of heads could be any number between 0 and 1!

Bayesian Updating with Continuous Priors


1. Learning Goals 🎯
By the end of this lesson, you'll be a pro at:

1. Understanding a Continuous Range of Hypotheses: We'll see how a "family" of possibilities (like a million slightly
different bent coins) can be our set of hypotheses.
2. Using Bayes' Theorem for Continuous Ideas: You'll learn how to use the same powerful Bayes' theorem, but this time
with smooth graphs (PDFs) and integrals instead of simple sums.
3. Updating from a Prior to a Posterior PDF: We'll practice starting with an initial guess (a prior probability density function)
and using new evidence to create a refined, smarter guess (a posterior PDF).
4. Making Predictions with New Knowledge: You'll be able to use your updated guess to predict what's likely to happen
next!

2. Introduction 🚀
So far, our Bayesian updating has been for a limited, countable number of hypotheses. Now, we're upgrading to handle
situations with an infinite, continuous range of possibilities.

The best part? The core logic is exactly the same. We're just swapping out our tools:

Probability Mass Functions (PMFs) become Probability Density Functions (PDFs).


Sums ( Σ ) become integrals ( ∫ ).

Think of it like upgrading from a pixelated image to a high-resolution photo. The underlying picture is the same, but now we
can see all the smooth details in between!

3. Examples with Continuous Ranges of Hypotheses


Let's look at where this shows up in the real world.

Example 1: The Bent Coin 🪙


You have a bent coin, and you want to know its probability p of landing heads. This p isn't just 0.5 or 0.6; it could be
0.5381, 0.72, or any number between 0 and 1. This continuous range of possible values for p is our continuous set of
hypotheses.
Example 2: Isotope Lifetime ⚛️
Scientists model the decay of radioactive isotopes with something called an exponential distribution. The key parameter, λ
(lambda), determines the average lifetime. This average lifetime could be any positive number, giving us another
continuous range of hypotheses.
Example 3: Normal Distributions 📈
Imagine trying to model the height of every student in your country. This often follows a bell curve (a normal distribution)
with a mean (average height μ ) and a standard deviation (how spread out the heights are σ ). The values for μ and σ could
be any real numbers, creating a 2-dimensional continuous set of hypotheses.
In each case, we have a parametrized distribution—a sort of template model where each choice of the parameter(s) ( p , λ , μ ,
σ ) is a specific hypothesis.

Quiz Question: Which of the following is an example of a continuous range of hypotheses?


a) Guessing whether a die is 4, 6, or 8-sided.
b) Guessing the exact temperature outside, which could be 20.1°C, 20.11°C, etc.
c) Guessing whether a card drawn from a deck is a heart, spade, club, or diamond.

Answer: b) Guessing the exact temperature outside.


Explanation: Temperature can take on any value within a range, making it continuous. The other options involve a finite,
discrete number of choices.

4. Notational Conventions ✍️
To keep things neat, let's agree on some notation.

4.1 Parametrized Models


Instead of writing "the hypothesis that the probability of heads is 0.75," we'll just say "the hypothesis θ = 0.75 " or even just
"the hypothesis θ ." We use the Greek letter θ (theta) as a general-purpose symbol for whatever parameter we're interested
in.

4.2 Big and Little Letters ✍️


We have two ways of talking about probability, and they run in parallel:

Big Letters (Events): This is the conceptual level. We talk about the probability P of an Event A , like P(Coin is Type A) .
Little Letters (Values): This is the calculation level. We use a pmf p(x) or pdf f(x) for a specific value x .

Here's how they match up in the Bayesian world:

Concept Big Letter (Event) Little Letter (Value/Density) What it Means

Prior P (H) p(θ) or f (θ) Our initial belief about the hypothesis.

Likelihood P (D ∣ H) p(x ∣ θ) or f (x ∣ θ) The chance of seeing data x if hypothesis θ is true.

Total Prob. P (D) p(x) or f (x) The overall chance of seeing the data x .
Posterior P (H ∣ D) p(θ ∣ x) or f (θ ∣ x) Our updated belief after seeing the data.

From now on, we'll mostly use the little letters. We'll use Greek letters ( θ , λ , μ , σ ) for hypotheses and English letters ( x , y ) for
data.

5. Quick Review of PDF and Probability 📊


Remember, a Probability Density Function (PDF), f (x), is not a probability! It's a measure of probability density.

Analogy: Think of a map showing population density. The map tells you how crowded an area is (people per square mile), but
to find the actual number of people, you have to look at a specific area (an interval).

For a PDF, the probability that our variable X falls between c and d is the area under the curve from c to d .
d

P (c ≤ X ≤ d) = ∫ f (x)dx
c

The tiny slice of probability in an infinitesimally small range dx around a point x is f (x)dx. The integral is just our way of
summing up all these tiny slices!

6. Continuous Priors, Discrete Likelihoods


Let's get practical. Often, our belief about a hypothesis is continuous (a PDF), but the data we collect is discrete (like
heads/tails).

Example 4: The Bent Coin with a Twist


We have a coin with an unknown probability θ of heads.
Our prior belief about θ is described by the PDF f (θ) = 2θ for θ between 0 and 1. This means we think it's more likely that
the coin is biased towards heads (e.g., f(0.9) is much bigger than f(0.1) ).
The likelihood of our data is discrete. If we get heads (x=1), the probability is θ . If we get tails (x=0), it's 1 - θ .

p(x = 1|θ) = θ, p(x = 0|θ) = 1 − θ

7. The Law of Total Probability


Remember the discrete version? To find the total probability of data D , we summed up the probabilities from each hypothesis:

P (D) = ∑ P (D|H i )P (H i )

For a continuous range, we do the same thing but replace the sum with an integral!

Theorem: Law of Total Probability (Continuous)


The total probability of observing discrete data x , given a continuous hypothesis θ with prior PDF f (θ), is:
b

p(x) = ∫ p(x|θ)f (θ)dθ


a

This is also called the prior predictive probability of x .

Example 5: Predicting the First Flip


Let's use our coin from Example 4 where f (θ) = 2θ. What's the probability of getting heads on the first flip?

We integrate the likelihood of heads ( θ ) times the prior for θ ( 2θ ) over all possible values of θ (from 0 to 1).
1 1

p(x = 1) = ∫ p(x = 1|θ)f (θ)dθ = ∫ θ ⋅ 2θdθ


0 0

1 1
3
2
2θ 2 2
p(x = 1) = ∫ 2θ dθ = [ ] = − 0 =
0
3 3 3
0

Our prior belief leaned towards heads-biased coins, so it makes sense our overall prediction for heads is high (2/3)!

Quiz Question: Why is the law of total probability for continuous variables an integral instead of a sum?
a) Integrals are more accurate than sums.
b) Because there are infinitely many hypotheses in a continuous range, and an integral is like an infinite sum over a continuous
range.
c) Because the data is continuous.

Answer: b) Because there are infinitely many hypotheses in a continuous range, and an integral is like an infinite sum over a
continuous range.
Explanation: An integral is the perfect tool from calculus for summing up contributions over a smooth, continuous interval.

8. Bayes' Theorem for Continuous Probability Densities


Just like the law of total probability, Bayes' theorem makes a smooth transition from discrete to continuous.

Theorem: Bayes' Theorem (Continuous)

p(x|θ)f (θ)dθ p(x|θ)f (θ)dθ


f (θ|x)dθ = =
b
p(x) ∫ p(x|θ)f (θ)dθ
a

This looks complex, but it's the same old recipe:

Likelihood × Prior
Posterior =
Total Probability

Often, we drop the dθ from both sides to write it in terms of densities:

p(x|θ)f (θ)
f (θ|x) =
p(x)

9. Bayesian Updating with Continuous Priors


Let's put it all together. We can use a simplified table to update our continuous prior. Since there's an infinite number of
hypotheses, we just use one row with θ as a variable.
Example 6: Updating After Three Flips
We're still using our coin with prior f (θ) = 2θ. We flip it three times and get the sequence HTT ( x ). What is our new, posterior
PDF for θ ?

The likelihood of getting HTT is p(x|θ) = θ ⋅ (1 − θ) ⋅ (1 − θ) = θ(1 − θ) . Wait, the coursework says HHT, which is θ
2 2
. Let's
(1 − θ)

follow the coursework example.


Data: HHT, so p(x|θ) = θ (1 − θ).
2

Hypothesis Range Prior Likelihood Bayes Numerator Posterior

θ [0,1] 2θdθ
2
θ (1 − θ)
2
2θ ⋅ θ (1 − θ)dθ
Numerator

Total

Calculation 3
2θ (1 − θ)dθ
3
20θ (1 − θ)dθ

Total 1 p(x) = ∫
0
1
3
2θ (1 − θ)dθ =
10
1
1

Our posterior PDF is f (θ|x) = 20θ 3


. We started with a guess (2θ) and after seeing HHT, we have a new, more informed
(1 − θ)

guess!

9.1 Flat Priors


What if we have no idea what θ could be? We can use a flat prior (or uniform prior). For θ between 0 and 1, a flat prior is just
f (θ) = 1. This means we initially believe all values of θ are equally likely.

Example 7: Updating a Flat Prior


Let's start with a flat prior, f (θ) = 1. We flip the coin once and get Heads.

Hypothesis Range Prior Likelihood Bayes Numerator Posterior

θ [0,1] 1 ⋅ dθ θ θdθ 2θdθ

Total 1 p(x = 1) = ∫
0
1
θdθ =
1

2
1

The posterior PDF is f (θ|x = 1) = 2θ. We started with a "don't know" belief, and after one heads, our belief shifted to favor
higher values of θ .

9.2 Using the Posterior PDF


Example 8: How Much Did Our Belief Change?
Using the flat prior from Example 7:

Before seeing any data, what was the probability the coin was biased towards heads ( θ > 0.5 )?
1 1
1
1
P (θ > 0.5) = ∫ f (θ)dθ = ∫ 1dθ = [θ] 0.5 = 1 − 0.5 =
0.5 0.5
2

A 50% chance, as we'd expect from a "don't know" prior.


After seeing one heads, what's the new probability it's biased towards heads?
1 1
2 1 2 2
3
P (θ > 0.5|x = 1) = ∫ f (θ|x = 1)dθ = ∫ 2θdθ = [θ ] 0.5 = 1 − (0.5) = 1 − 0.25 =
0.5 0.5
4

Our belief jumped from 50% to 75% after just one flip!

10. Predictive Probabilities


Now we can use our updated knowledge to predict the next flip. This is the posterior predictive probability.

Example 9: Predicting the Second Flip


Let's go back to our original problem: prior f (θ) = 2θ.

Prior predictive probability of heads (we found this in Ex 5): p(x 1 = 1) = 2/3 .
Now, suppose the first flip was heads (x 1 = 1 ). We need the posterior PDF. The coursework has a calculation error here.
Let's fix it.
Prior: f (θ) = 2θ
Likelihood: p(x 1 = 1|θ) = θ

Bayes Numerator: 2θ ⋅ θ = 2θ 2

Total Probability: p(x


1
2
1
= 1) = ∫ 2θ dθ = 2/3
0

Posterior PDF: f (θ|x . This matches the text!


2
2θ 2
1
= 1) = = 3θ
2/3

Now, what's the posterior predictive probability of heads on the second flip? We use our new posterior PDF:
1

p(x 2 = 1|x 1 = 1) = ∫ p(x 2 = 1|θ)f (θ|x 1 = 1)dθ


0

1 1 1
4
2 3
3θ 3
= ∫ θ ⋅ (3θ )dθ = ∫ 3θ dθ = [ ] =
0 0
4 4
0

Our prediction for heads increased from 2/3 (66.7%) to 3/4 (75%) after seeing one heads.

11. (Optional) From Discrete to Continuous Bayesian Updating


How do we get from blocky bar charts (discrete pmfs) to smooth curves (continuous pdfs)? By taking smaller and smaller
steps! This is the core idea of calculus, called a Riemann sum.

Imagine we approximate a flat prior by saying θ can only be one of 4 values: 1/8, 3/8, 5/8, 7/8 . We give each a prior
probability of 1/4.

If we get one heads, we can make a standard discrete update table. The result is a bar chart for our posterior belief.

Now, what if we use 8 values? Or 20? Or 1000?

As we slice the continuous range into more and more tiny pieces, our blocky histograms for the prior and posterior get
smoother and smoother. Eventually, in the limit, they converge to the smooth PDF curves we've been calculating with
integrals! The discrete sum in the update table becomes the integral for the total probability.

Required Reading
Hello! Let's dive into one of the biggest ideas in statistics: the Bayesian approach. Think of it as a different philosophy for how
to think about evidence and uncertainty. It's like being a detective with a powerful new way to solve cases! 🕵️‍♂️

An Introduction to the Bayesian Approach


At its heart, the Bayesian method has two key ingredients you mix together:

1. A Parametric Model (f (V |θ)): This is your theory about how the world works. It's a rule that says, "If the secret parameter
T

θ has a certain value, here's the probability of seeing the data V ."

Analogy: This is your "theory of the crime." For example, "If the suspect is left-handed ( θ ), this is the likely pattern of
evidence (f (V |θ))."
T

2. A Prior Distribution (p(θ)): This is your belief, hunch, or suspicion about the secret parameter before you see any
evidence.

Analogy: This is your "initial list of suspects." You might have a prior belief that some suspects are more likely than
others based on past experience.

By combining these two pieces using Bayes' Rule, you get the Posterior Distribution (p(θ|V )). This is the magic part! It T

represents your updated belief about the parameter after you've seen the data.

The formula looks like this:

f (V T |θ)p(θ)
p(θ|V T ) =
p(V T )

Let's translate that into simple English:

(Likelihood of seeing the evidence given your theory) × (Your initial belief )
Updated Belief =
(Overall probability of seeing that evidence)

Once you have this updated belief (the posterior), you can make smart conclusions, like picking the most likely value for your
parameter or creating a "credible set"—a range where you're 95% sure the true parameter value lies.
Quiz Question: What is a "prior" in the Bayesian world?
a) The evidence you collect.
b) Your final conclusion after seeing the data.
c) Your initial belief or hunch about something before you see any evidence.

Answer: c) Your initial belief or hunch about something before you see any evidence.
Explanation: The prior is your starting point, which you then update with data to get your posterior (final) belief.

Differences between Bayesian and Frequentist Approaches


Bayesian and Frequentist are the two main "teams" in statistics. They think about probability in fundamentally different ways.

Analogy: The Archer 🏹

Imagine an archer trying to hit a bullseye ( θ ) that is fixed but hidden behind a screen.

The Frequentist Archer:


Belief: The bullseye's position ( θ ) is fixed and unknown. Probability comes from the randomness of shooting arrows
(sampling).
Action: They shoot 1,000 arrows (collect many samples). They then draw a circle around 95% of their arrow holes.
Conclusion: They say, "I am 95% confident that the method I used to draw this circle will capture the true bullseye in
the long run." They cannot say there is a 95% probability the bullseye is in that specific circle. For them, the bullseye
is either in it or it's not—no probability involved for this one specific circle.
The Bayesian Archer:
Belief: The bullseye's position ( θ ) is uncertain, so we can use probability to describe our belief about where it is.
Action: They start with a hunch (a prior) about where the bullseye might be. They shoot just one arrow (the data).
Conclusion: Based on where that single arrow landed, they update their hunch and draw a circle. They say, "Given
my data, there is a 95% probability that the true bullseye is inside this circle."

Here's a quick summary:

Feature Frequentist Approach Bayesian Approach

What is θ (the Fixed, but unknown. A random variable we have beliefs about.
parameter)?

What is probability? The long-run frequency of an event over many A degree of belief or confidence about
repeated trials. something.

What is fixed? The parameter θ is fixed. The data V is fixed (once you've seen it).
Confidence/Credible "95% confidence" is about the reliability of the "95% credible set" is a direct probability
Sets method over many repeats. statement about where θ likely is.

Reasons to be Bayesian
Why might you choose the Bayesian way of thinking? Here are a few powerful reasons.

Reason 1 - It's Philosophical and Often More Intuitive 🧠


The Bayesian conclusion often feels more natural. Imagine a medical test that is "95% accurate." If you test positive, a
Bayesian can tell you the actual probability you have the disease. A frequentist can only talk about the long-run performance
of the test.

This is especially true in time series, like analyzing the stock market or inflation. We only have one history of the world! The
frequentist idea of "repeated samples" (imagining thousands of other worlds to get different stock market histories) can feel a
bit strange. The Bayesian approach of updating our beliefs as one single timeline unfolds often makes more sense.
Reason 2 - Convenient Shortcuts Exist (Conjugate Priors) शॉर्टकट
Sometimes, the math can be really clean. A conjugate prior is when your prior belief and the data's likelihood are from the
same "family" of distributions. When you combine them, the posterior belief is also in that same family!

Analogy: It's like mixing colors. If you start with a specific shade of blue paint (a Normal distribution prior) and mix it with data
that also behaves like a Normal distribution, you're guaranteed to get a perfect shade of green that is also a Normal
distribution (the posterior). This makes calculations much easier.

Reason 3 - With Enough Data, We All Agree 🤝


A fascinating result is that, in most "clean" situations, the influence of your prior vanishes as you get more and more data.

Analogy: Two detectives are on a case. Detective A has a strong hunch (prior) that the butler did it. Detective B has no idea. If
they collect a mountain of DNA evidence that points to the gardener, both detectives will end up agreeing that the gardener is
the culprit. The initial hunch becomes irrelevant in the face of overwhelming evidence.

This means that for large datasets, Bayesian and frequentist results often look very similar.

Reason 4 - It Handles "Nuisance Parameters" Gracefully ✨


Often, a model has many parameters, but you only care about one of them. The other parameters are a "nuisance."

Analogy: You want to find the average height of students ( θ₁ ). Your model also includes their average shoe size ( θ₂ ), which
you don't care about. The Bayesian framework has a very natural way to "integrate out" or average over all the possibilities for
shoe size, giving you a clean answer just for height. This can be extremely difficult in the frequentist world.

Reason 5 - It Can Be Easier to Implement 💻


For very complex modern models, it can be incredibly hard to find the single "best" parameter estimate in the frequentist way.
Bayesian methods, often using computer simulations (like MCMC), can explore the landscape of possibilities and find the
posterior distribution more easily.

Reason 6 - Priors Can Add Identification (A Risky Reason!) ⚠️


If your data isn't good enough to give a clear answer (the model is "not identified"), a frequentist might be stuck. A Bayesian's
prior belief can provide the extra information needed to get an answer. However, this is a double-edged sword: your final
answer might be driven more by your initial hunch than by the actual data!

Quiz Question: What happens to the Bayesian's prior belief when a very large amount of data is collected?
a) The prior becomes even more important.
b) The prior's influence washes out and becomes irrelevant.
c) The prior and the data become equally important.

Answer: b) The prior's influence washes out and becomes irrelevant.


Explanation: As you collect more data, the evidence starts to dominate your initial belief, and your posterior becomes almost
entirely shaped by the data.

Bayes Estimates, Tests, and Sets


Point Estimation: Picking the "Best" Value
Once you have your posterior distribution (your new, updated belief), you might want to pick a single number as your best
estimate. What's "best" depends on your goal:

Posterior Mean: The average value of your posterior distribution. This is the best choice if you want to minimize the
squared error of your guess.
Posterior Mode: The peak of your posterior distribution—the single most likely value.
Posterior Median: The middle value (50th percentile). This is a good choice if you want to avoid big errors in one direction
more than the other.
Testing Hypotheses
In the Bayesian world, you test ideas by comparing their posterior probabilities. For a hypothesis H ("the parameter is 0") vs.
0

H ("the parameter is not 0"), you can calculate the posterior odds ratio:
1

P (H 0 |Data)
Posterior Odds =
P (H 1 |Data)

If this ratio is large, you favor H . If it's small, you favor H . This approach treats both hypotheses more symmetrically than the
0 1

frequentist p-value approach.

A tricky problem is testing a point null (e.g., H 0 ). With a continuous prior, the probability of any exact point is zero! The
: θ = 0

clever solution is to use a prior that puts a specific chunk of belief, say 20%, exactly on θ = 0 and spreads the other 80% over
the other possibilities. This allows for a direct and sensible test.

Main Lesson 3
Hey there! 🚀 Let's dive into an exciting topic: Bayesian Estimation for GARCH models. It might sound complicated, but think
of it as being a data detective 🕵️‍♀️, using clues to update your guesses and uncover the truth. We'll break it all down step-by-
step. Let's get started!

BAYESIAN ESTIMATION FOR GARCH MODEL


Welcome to a new way of looking at statistics! In the last lesson, we learned about one way to estimate GARCH models called
Maximum Likelihood Estimation (MLE). Now, we're going to explore a cool alternative: Bayesian Estimation.

Imagine you're trying to guess the score of a video game match before it starts. The old way (called Frequentist statistics) is
to just look at past data. The new way (Bayesian statistics) lets you use your own intuition or prior knowledge (like "Team A
has a superstar player!") and then update your guess as the game unfolds. It's a powerful and flexible way to think about data,
and it's used everywhere from finance to AI.

Let's install the tools we'll need for our journey.

PYTHON
pip install arviz pymc pytensor

Now, let's import the libraries and load our dataset, which contains Google stock prices. We'll be looking at the daily returns of
the stock.

PYTHON
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
from pymc import GARCH11
from pytensor import shared
import pytensor
pytensor.config.mode = 'NUMBA'

PYTHON
# Download the dataset
m5_data = pd.read_csv("M5. goog_eur_10.csv")

# Convert date variable to date format


m5_data["Date2"] = pd.to_datetime(m5_data["Date"], format="%m/%d/%Y")
goog = m5_data.loc[:, ["Date2", "GOOGLE"]].set_index("Date2")
goog["GOOGLE_R"] = np.log(goog.GOOGLE).diff().dropna()

1. Bayesian Statistics
1.1 What is Bayesian Statistics?
So, what's the big deal with Bayesian statistics? It all comes down to one core idea.
Frequentist Statistics: This is what you've likely learned in school so far. It treats parameters like the mean (average) or
variance (spread) of data as fixed, unknown numbers. Our job is to estimate that one true number.
Bayesian Statistics: This approach is different. It treats parameters as random variables. This means instead of one
single "true" answer, a parameter can have a whole range of possible values, each with a certain probability.

Analogy: Guessing the Number of Gumballs 🍬

Imagine a giant jar of gumballs.

A frequentist would take a sample of gumballs, count them, and use that to make a single, best guess for the total
number in the jar.
A Bayesian, on the other hand, starts with an initial guess (maybe based on the size of the jar). This initial guess is called
a prior. Then, they take a sample of gumballs and use that new information to update their initial guess. The updated
guess, which is a range of likely values, is called a posterior.

This method is sometimes seen as more flexible, but it can also be criticized because the researcher's initial guess (the prior)
can influence the result. It's a big debate in the world of statistics!

Quiz Question: What is the main difference between how frequentist and Bayesian statistics view parameters (like the mean)?
Answer: Frequentist statistics sees parameters as single, fixed values that we try to estimate. Bayesian statistics sees them
as random variables that have a probability distribution.
Explanation: Think of it like finding a lost key. A frequentist believes the key is in one specific spot, and they're trying to find
that spot. A Bayesian thinks the key could be in several places and assigns a probability to each spot, updating those
probabilities as they search.

1.2 Basics of Bayesian Updating, Prior Probability Distributions, Posterior Probability


Distributions, and Likelihood Function
This is the heart of the Bayesian method! It’s all about updating your beliefs.

Prior Probability Distribution (Prior): This is your belief about a parameter before you see any data. It's your starting
point or your initial guess. For example, "I think there's a 60% chance my favorite team will win tonight."
Likelihood Function (Likelihood): This is the new information you get from your data. It tells you how likely it is to see this
data, given a certain value of the parameter. For example, "The data shows my team has won 8 of their last 10 games."
Posterior Probability Distribution (Posterior): This is your updated belief after you combine your prior with the likelihood
from the data. It's the grand finale!

The process of going from a prior to a posterior is called Bayesian updating.

Diagram:

🧠 Prior Belief(Your initial 📊 Likelihood(Evidence


guess) from data)

Combine with..

🎯 Posterior Belief(Your
new, updated guess)

Quiz Question: If your "prior" is your initial guess, what is your "posterior"?
Answer: Your posterior is your updated guess after you've looked at the evidence or data.
Explanation: It's like starting with a hunch (prior), then gathering clues (likelihood), and finally forming a more educated
conclusion (posterior).

1.3 Sampling from Posterior Probability Distribution


Sometimes, the math for the posterior distribution is super complicated. It’s not a nice, simple equation we can just solve. So
what do we do? We cheat! 😉 We use a computer to draw thousands of random samples from it. By looking at all these
samples, we can get a really good picture of what the posterior distribution looks like.

The most popular way to do this is called Markov-Chain-Monte-Carlo (MCMC).

Analogy: Finding the Deepest Part of a Pool 🏊


Imagine you're blindfolded in a huge, weirdly shaped swimming pool, and your goal is to map out the deepest areas (these
deep spots represent the most probable values for our parameter).
Monte Carlo: This means you take random steps around the pool to explore it.
Markov-Chain: This means your next step depends on where you are right now. If you're in a deep spot, you're more
likely to stay nearby. If you're in a shallow spot, you might take a bigger step to find a deeper area.
By taking thousands of these steps (samples), you'll spend most of your time in the deepest parts, giving you a great
map of the posterior distribution!

The two most popular MCMC algorithms are the Metropolis-Hastings algorithm and the Gibbs Sampling algorithm. The
software we'll use employs Metropolis-Hastings.

A key thing to remember is that MCMC can be sensitive to where you start. If you start at the shallow end of the pool, your
first few steps might be a bit wild. That's why we sometimes use a good starting guess, like the one we got from the MLE
method in the last lesson.

Quiz Question: Why do we use methods like MCMC to sample from a posterior distribution?
Answer: Because the posterior distribution's formula is often too complex to work with directly. Sampling gives us an
approximate picture of it.
Explanation: It's like trying to understand a giant, complex painting. Instead of trying to describe the whole thing at once, you
take thousands of tiny photo samples from all over the canvas. By looking at all the samples, you can piece together what the
whole painting looks like.

2. Bayesian Estimation on GARCH Model


Alright, let's apply this detective work to our GARCH model! Remember, a GARCH model helps us predict volatility (how much
a stock's price will jump around). We'll focus on a GARCH model that uses a Student's t-distribution. Why? Because financial
data often has "fat tails," meaning extreme events (like a market crash) happen more often than a normal bell curve would
predict. The Student's t-distribution is great at capturing these surprises.

Here's the model's formula:

ν − 2
2
rt = ϵt √ ωt σt
ν

And the famous GARCH variance part:


2 2 2
σ t = α 0 + α 1 r t−1 + βσ t−1

Let's break that down:

rt : The return of our stock on day t.


σ
2
t
: The variance (our measure of wobbliness or volatility) on day t.
ϵt : A random shock, like a surprise news event.
ωt : A special random variable that helps make our model follow a Student's t-distribution.
ν: Called the "degrees of freedom," this number controls how "fat" the tails of our distribution are. A smaller ν means fatter
tails and more room for surprises!
: These are the key GARCH parameters we need to estimate. They control how past volatility and past returns
α0 , α1 , β

affect today's volatility.

To make sure our model works, we need some rules: α 0



> 0 1
≥ 0 , β ≥ 0, and ν . We also need α
> 2 1
+ β < 1 to ensure the
volatility doesn't explode to infinity.

The big posterior formula we are trying to understand looks like this:
l(r t |α 0 , α 1 , β, ν, ω t ) p(α 0 , α 1 , β, ν, ω t )
p(α 0 , α 1 , β, ν, ω t |r t ) =
p(r t )

This just says: Posterior = (Likelihood * Prior) / Normalizing Constant.

Now, let's use the Python package PyMC to estimate the GARCH(1,1) model for Google's stock returns.

Round 1: The "Blank Canvas" Model


First, we'll try running the MCMC sampler with very generic starting values (close to zero). This is like starting our blindfolded
swimmer at a random spot in the pool.

PYTHON
# starting parameters = blank canvas model(0.000001, 0.000001, 0.000001)
alpha_mu = shared(np.array([0.000001, 0.000001], dtype=np.float64))
alpha_sigma = shared(np.array([[1000.0, 0.0], [0.0, 1000.0]], dtype=np.float64))
beta_mu = shared(np.array(0.000001, dtype=np.float64))
beta_sigma = shared(np.array(1000.0, dtype=np.float64))
ivolatility = shared(np.array(0.000001, dtype=np.float64))
ivolatility_vol = shared(np.array(10.0, dtype=np.float64))

# construct MCMC model


mcmc0 = pm.Model()
with mcmc0:
mvn = pm.MvNormal("mvNormal", mu=alpha_mu, cov=alpha_sigma, shape=2)
alp0 = pm.Deterministic("alpha0", pm.math.switch(mvn[0] > 0, mvn[0], -np.inf))
alp1 = pm.Deterministic("alpha1", pm.math.switch(mvn[1] > 0, -np.inf))
nTruncated = pm.TruncatedNormal("beta", mu=beta_mu, sigma=beta_sigma, lower=0)
volTruncated = pm.TruncatedNormal("volatility", mu=ivolatility, sigma=ivolatility_vol, lower=0)
likelihood = GARCH11( "GARCH", omega=alp0, alpha_1=alp1, beta_1=nTruncated, initial_vol=volTruncated,
observed=goog.GOOGLE_R.dropna() * 100,)

We'll generate two sample series, called chains. Why two? It's like sending two blindfolded friends into the pool. If they both
come back with similar maps of the deep end, we can be more confident they found the right spot. This check is called
convergence diagnostics.

PYTHON
# Plot first round MCMC model posteriors
with mcmc0:
trace_mcmc0 = pm.sample(3000, cores=2, step=pm.Slice(), tune=0, return_inferencedata=True, random_seed=12345)
az.plot_trace(trace_mcmc0, var_names=["alpha0", "alpha1", "beta"],
lines=[("alpha0", {}, [0.124993]),("alpha1", {}, [0.082160]),("beta", {}, [0.867127])],
compact=False, legend=True, figsize=(16, 7))
plt.tight_layout()
plt.show()

First Round MCMC Sampling Result


The output of this code would be a set of plots.

Trace Plots (Right side of the plot): These show the journey of each chain for each parameter (α , α , β). You'd see two 0 1

lines (one for each chain) zig-zagging. In this first run, the lines would look pretty wild at the beginning before settling
down. This suggests the sampler took a while to find the "deep end" of the pool.
Density Plots (Left side of the plot): This is the final map! It shows a smooth curve representing the posterior distribution
(our updated belief) for each parameter.

Now let's look at the summary statistics.

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat

alpha0 0.145094 0.053294 0.059351 0.262754 0.007721 0.004627 47.599936 51.832998 1.068832

alpha1 0.089880 0.025170 0.048496 0.141094 0.003876 0.002370 47.796746 48.588985 1.066640
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat

beta 0.852806 0.042358 0.757265 0.916286 0.006640 0.004178 44.814853 48.402712 1.074747

The most important column here is the last one: r_hat . This is the Gelman-Rubin statistic. It checks if our chains have
"converged" or agree with each other. A value close to 1.0 is great. A value above 1.1 or 1.2 is a red flag 🚩 that something's
wrong. Our values are around 1.07, which isn't perfect.

Round 2: Using Better Starting Values


The MCMC can be sensitive to its starting point. A bad start can lead to slow convergence. So, let's give it a hint! We'll use the
results from our old friend, Maximum Likelihood Estimation (MLE), as the starting values. This is like dropping our blindfolded
swimmers right into the middle of the pool.

PYTHON
# starting parameters = MLE(0.124993, 0.082160, 0.867127)
alpha_mu = shared(np.array([0.124993, 0.082160], dtype=np.float64))
beta_mu = shared(np.array(0.867127, dtype=np.float64))
ivolatility = shared(np.array(1.63865, dtype=np.float64))
# ... (rest of the model setup is similar)

After running the MCMC again with these new starting values, the trace plots would look much more stable from the very
beginning. Let's check the new r_hat values.

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat

alpha0 0.151440 0.091908 0.067844 0.240691 0.007156 0.043919 82.938764 221.227579 1.029334

alpha1 0.092149 0.024706 0.054129 0.132585 0.002455 0.002861 92.183544 318.266329 1.024602

beta 0.848298 0.051242 0.782080 0.913977 0.004892 0.015678 75.874103 231.988360 1.030617

Much better! The r_hat values are closer to 1.0. This shows that a good starting point can really help.

Round 3: Adding a Burn-in Period


Even with a good start, the first few hundred steps of the MCMC chain can be a bit wonky as the sampler "warms up." It's
common practice to throw these early samples away. This is called the burn-in period.

Let's run the model one last time, telling it to discard the first 250 samples from each chain.

PYTHON
with mcmc:
# ...
trace_mmc_a = pm.sample(3000, cores=2, step=pm.Slice(), tune=250, # tune=250 means a 250-step burn-in
return_inferencedata=True, random_seed=12345)
# ... (plotting code)

The final trace plots should now look very stable and consistent, like a fuzzy caterpillar 🐛. The density plots will give us our
final answer for what the parameters likely are.

Let's check the final table.

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat

alpha0 0.143620 0.043415 0.074571 0.231585 0.004067 0.002347 108.231247 235.146465 1.008775

alpha1 0.090170 0.019715 0.057861 0.128158 0.001783 0.001058 119.996201 301.232062 1.003836

beta 0.852987 0.032861 0.786890 0.907703 0.003265 0.001904 97.762053 211.792705 1.007301
Success! 🎉 All the r_hat values are very close to 1.0. This gives us confidence that our MCMC simulation worked well and our
results are reliable.

Final Comparison
How do our final Bayesian estimates compare to the MLE method from the last lesson?

MLE(norm) MLE(st) Bayesian

omega/alpha0 0.1250 0.0616 0.1436

alpha1 0.0822 0.1039 0.0901


beta 0.8671 0.8865 0.8530

The results are quite similar! This is a good sign. Using the MLE results as starting values helped guide our Bayesian model to
a comparable and reliable solution.

Quiz Question: What is a "burn-in" period in MCMC, and why is it useful?


Answer: A burn-in period is the first set of samples from an MCMC chain that we decide to throw away.
Explanation: It's useful because the sampler might take a while to "warm up" and find the important regions of the posterior
distribution. Discarding these early, potentially unreliable samples gives us a better final result, just like a chef might discard
the first slightly-burnt pancake from the pan.

4. Conclusion
Wow, you made it! In this lesson, we took a deep dive into the world of Bayesian statistics.

We learned the difference between the frequentist and Bayesian approaches.


We explored the key concepts of priors, likelihoods, and posteriors.
We saw how MCMC sampling helps us map out complex posterior distributions.
Finally, we applied all this knowledge to estimate a GARCH model for Google's stock returns, learning how to use starting
values, check for convergence, and use a burn-in period to get solid results.

Bayesian statistics is an incredibly powerful tool that's becoming more and more important, especially in fields like machine
learning. You've taken a huge step in understanding how modern data detectives work! Keep exploring! 🌟

LESSON 4
Required Reading
Hello! 🕵️‍♂️ Get ready to become a data detective as we dive into State-Space Models and the super-cool Kalman Filter. Think
of it like trying to track a hidden submarine. You can't see the sub itself (that's the hidden state), but you get blips on your
sonar every few minutes (those are your measurements). Our goal is to use those blips to figure out exactly where the sub is
and where it's going next! 🚀

State-Space Models
In this lesson, we're looking at a powerful tool called state-space models. They're used a ton in economics, especially for
tracking things that we can't see directly.

Example 1: The Secret Trend in GDP Growth 📈


Imagine we're looking at how much a country's economy (its GDP) is growing. We can see the final growth number, let's call it
y , but it's actually made of two secret parts:
t

yt = μt + ϵt

μ t = μ t−1 + η t

Let's decode this:

yt : The GDP growth we can actually see and measure.


μt: This is the hidden, slow-moving trend of the economy. Think of it as the economy's true underlying speed. This is our
unobserved state.
ϵt : This is just random, unpredictable noise, like a sudden good or bad news report that causes a small blip.
The second equation, μ t = μ t−1 + η t , shows how the hidden trend moves over time. It says today's trend is just yesterday's
trend plus a little random nudge, η . t

The big challenge is that we only see y . We have to play detective to figure out the secret trend, μ !
t t

Example 2: Are We in a Boom or a Bust? 🎢


Another cool example is figuring out if the economy is in a "boom" (growing fast) or a "bust" (not growing). Let's say the
hidden state, S , can be either 1 (for a boom) or 0 (for a bust).
t

Our GDP growth, y , depends on this hidden state:


t

yt = β0 + β1 St + ϵt

This just means that when we're in a boom (S = 1), the growth is higher. We also have probabilities that describe how we
t

switch between states (e.g., the chance of going from a bust to a boom). Again, we only see the final GDP growth, y , not the t

secret state S . t

What's the Goal?


In both cases, we have something we can see (y ) that depends on something we can't see (the hidden state, which we'll call
t

α ). Our detective work involves:


t

1. Estimating Parameters: Figuring out the values of things like the amount of noise (σ and σ ). 2
ϵ
2
η

2. Extracting the State: Finding the value of the hidden state (like estimating the true trend μ ). t

3. Forecasting: Predicting what y will be in the future.


t

The Two Key Equations


Every state-space model has two main equations:

1. State Equation: Describes how the hidden state, α , changes over time. It's the rulebook for the secret submarine's
t

movement.
2. Measurement Equation: Describes how the clues we see, y , are related to the hidden state. It explains how the sonar
t

blips are created by the submarine.

Using these two equations, we can figure out the probability (or "likelihood") of seeing the data we've collected. This process
of using the data to update our beliefs about the hidden state is called filtering.

The Three Steps of Filtering


Filtering is like a repeating cycle of predicting and updating.

1. Combine Clues: Figure out the probability of your next measurement based on all past clues.
2. Predict: Guess where the hidden state will be before you get your next clue.
3. Update: Use your new clue to update your guess about the hidden state.

Analogy: Finding a Friend in a Mall 🛍️

1. Predict: You know your friend was in the food court. You guess they're now heading towards the movie theater. (f (α t |Y t−1 )

)
2. Get a Clue: Your friend texts you, "Just passed the shoe store!" (y ) t

3. Update: You combine your prediction with this new clue. Now you have a much better idea of exactly where they are. (
f (α |Y ))
t t

This process can be mathematically tricky because of some complex math called integration. But, there are two cases where
it's easy:

When the states are simple choices (like "boom" or "bust").


When everything follows a Normal Distribution (the classic "bell curve").

When we have normal distributions, we can use a super-powerful tool called the Kalman Filter!

Quiz Question: What are the two main equations in a state-space model, and what do they represent?
Answer: The State Equation and the Measurement Equation.
Explanation: The State Equation is the rulebook for how the hidden thing you're tracking moves (e.g., a submarine's path).
The Measurement Equation explains how the clues you see are related to that hidden thing (e.g., how the submarine creates
sonar blips).

Kalman Filtering
The Kalman Filter is the perfect tool for state-space models when all the random noise is "normal" (follows a bell curve).

Let's write down our spy-thriller equations:

1. State Equation: How the state evolves.

α t = T α t−1 + Rη t

2. Measurement Equation: How we get our clues.

y t = Zα t + Sξ t

Decoding the Spy Manual:

αt : The hidden state (the sub's true position at time t).


yt : The measurement we observe (the sonar blip at time t).
ηt and ξ : Random, unpredictable noise.
t

T , R, Z, S : These are just matrices (collections of numbers) that define the rules of the game. For example, T describes
how the sub moves from one moment to the next.

Since everything is based on normal distributions, all our predictions and updates will also be normal distributions. A normal
distribution is defined by just two things: its mean (center) and its variance (spread). So, the Kalman filter is just a set of
recipes for calculating these means and variances!

The filter works in a loop:

1. Predict Step: It predicts the mean (α t|t−1


) and variance (P t|t−1
) of the state for the next time step, based on everything
seen so far.
2. Update Step: When the new measurement (y ) arrives, it updates the prediction, giving us a more accurate new mean (α
t t|t

) and variance (P ). t|t

Diagram:
Updated Belief at t-1
αt-1, t-1, Pt-1, t-1

Predict Step

Predicted Belief for t Get New Measurement


αt, t-1, Pt, t-1 yt

Update Step

Updated Belief at t
αt, t, Pt, t

Loop to next time step

This loop runs forward in time, getting better and better at tracking the hidden state with each new piece of information.

Quiz Question: The Kalman filter is a recipe for calculating what two numbers at each step?
Answer: The mean and the variance.
Explanation: Because we assume everything follows a normal (bell curve) distribution, we only need to keep track of the
center of the curve (the mean) and how wide it is (the variance). The Kalman filter just tells us how to update these two
numbers as new clues come in.

Kalman Smoother
The Kalman filter is great for real-time tracking, as it uses all information up to the present moment. But what if you have the
complete set of data for an entire year and want the absolute best estimate for the hidden state back in June?

For this, we use the Kalman Smoother. After the Kalman filter runs all the way to the end (forward in time), the smoother runs
backward from the end to the beginning. At each step, it revises the estimate using information from the future, making it
even more accurate.

Kalman Filter: Best guess using the PAST.


Kalman Smoother: Best guess using the PAST and the FUTURE.
Quiz Question: What is the key difference between the Kalman filter and the Kalman smoother?
Answer: The Kalman filter runs forward in time and uses past data to estimate the present state. The Kalman smoother runs
backward in time (after the filter is done) to refine those estimates using all data, including future data.
Explanation: Think of watching a mystery movie. The Kalman filter is your theory of "who did it" halfway through the movie.
The Kalman smoother is your final, much more accurate theory after you've seen the whole movie, including the ending.

Summary
For a state-space model, the Kalman filter is a superstar that can:

Calculate the likelihood of your data, which helps in estimating parameters.


Track a hidden state in real-time (α ).
t|t

Forecast future measurements (y t+1|t


).
Work with a Kalman Smoother to give the most accurate "after-the-fact" picture of the hidden state (α t|T
).

What can be cast into state-space form?


Lots of famous time series models can be sneakily rewritten in the state-space form!

AR(p) models: Where the current value depends on past values.


MA(q) models: Where the current value depends on past random shocks.
ARMA(p,q) models: A mix of both.

For ARMA models, it's almost impossible to calculate the likelihood without the Kalman filter, making it a truly essential tool!

What else can be done by Kalman filter? Error-in-variables


Imagine you want to study the "real" interest rate, but you can't observe it directly because you don't know what people
expect inflation to be. You only see the realized interest rate. This is an "error-in-variables" problem. We can set this up as a
state-space model where the hidden state is the true interest rate, and the measurement is the noisy realized rate we get to
see. The Kalman filter can then help us estimate the true, hidden rate!

Missing or unequally spaced Observations


What if you're a spy who only gets clues on Mondays and Wednesdays? What do you do about Tuesday? The Kalman filter can
handle this! If an observation is missing, the filter's "Update" step is simply skipped. The "Predict" step just makes a forecast
based on the last available data. It's an elegant way to fill in the gaps and handle real-world messy data.

Analogy: Tracking a Car with a Spotty GPS 🚗


Imagine you're tracking a car, but the GPS signal cuts out for a minute. The Kalman filter would use the car's last known speed
and direction (the Predict step) to estimate where it is during the blackout. When the signal comes back (a new
Measurement), it instantly corrects its estimate (the Update step). It does this automatically!

Quiz Question: You are using the Kalman filter to track a satellite, but a solar flare makes you lose the signal for an hour. What
does the filter do during that hour?
Answer: It keeps running its "Predict" step, using the satellite's last known trajectory to estimate its path. It temporarily skips
the "Update" step until a new signal is received.
Explanation: The filter doesn't just give up! It uses its internal model of how the satellite moves to coast through the missing
data period, making the best guess possible until it gets new information.

Main Lesson 4

GARCH Model Under State Space Model Construct


Welcome to a fascinating intersection of statistics and finance! In this lesson, we'll see how a powerful technique from
aerospace engineering, the State Space Model (SSM), can be used to build a GARCH model. We'll start with the basics of
SSM, introduce its most famous version called the Kalman Filter, and then combine everything to see how it all works
together.
1. State Space Models 🛰️
1.1. The Basics of a State Space Model
Imagine you're trying to figure out your friend's mood. You can't directly see their "mood," but you can see their text
messages.

The hidden, unobservable thing (their mood) is the state.


The things you can see (their texts) are the observations.

A State Space Model does exactly this! It helps us understand a hidden state by looking at the observations it produces. It
uses two main equations to do this.

Observation Equation: This equation connects the hidden state to what we can actually see.

yt = At xt + vt , v t ∼ W N (0, V t )

yt : The observation at time t (e.g., your friend's text message).


xt : The hidden state at time t (e.g., their actual mood).
At : A rule that links the state to the observation.
vt : Random measurement noise. It's like a typo or random emoji that doesn't perfectly reflect their mood.

State Equation: This equation describes how the hidden state changes over time.

x t = Φ t x t−1 + w t , w t ∼ W N (0, W t )

xt : The state today.


x t−1 : The state yesterday.
Φt : A rule that describes how yesterday's state turns into today's state.
wt: Random system noise. It's a random event that could change their mood unexpectedly (like finding $20 on the
street!).

Diagram:
Here’s how the two parts are connected:

System

State at t=0

State Equation

State at t=1

State Equation

Observation Equation State at t=2

Observation Equation

Observations

Observation at t=1 Observation at t=2

A quick note on the rules (A and Φ ):


t t

If they can change over time (they have the little t), it's a time-varying model.
If they are fixed and don't change, it's a time-invariant model.
The Three Goals of SSM 🎯
Once we build a model, we want to use it for:

a. Predicting: Guessing the future state (What will my friend's mood be tomorrow?).
b. Filtering: Estimating the current state using all info up to today (What is my friend's mood right now based on the text
they just sent?).
c. Smoothing: Revising our estimate of a past state using info we got later (Looking back, what was their mood yesterday,
knowing what they texted today?).

Quiz Question: In a State Space Model, which variable can you directly see and measure: the "state" or the "observation"?
Answer: The observation.
Explanation: The observation (y ) is the data we can collect, like a stock price or a text message. The state (x ) is the hidden
t t

underlying factor, like market volatility or a person's mood, that we are trying to estimate.

1.2. Properties of State Space Model


SSMs have two key properties that make them special.

a. x is a Markov Process
t

This is a fancy way of saying the state is memoryless. The future state only depends on the current state, not the entire
history of how it got there.

Analogy: In a video game, your character's next action depends on their current health and position, not every single
move you made since the start of the game.
In math terms: P (x t |x t−1 , x t−2 , ⋯ , x 0 ) = P (x t |x t−1 )

b. y is fully specified given x


t t

This means that if you magically knew the exact hidden state at time t, you'd know everything you need to know about the
observation at time t (except for some random noise).

Analogy: If you knew your friend was ecstatically happy (the state), you could predict they would send a text full of happy
emojis (the observation).

These properties allow us to write a formula for the whole system:


t

P (x 0:t , y 1:t ) = P (x 0 ) ∏ P (x i |x i−1 )P (y i |x i )

i=1

This just means we can figure out the probability of everything happening by combining the probability of the starting state,
how states change, and how observations are made.

Quiz Question: What does it mean for a process to be "memoryless"?


Answer: It means that the future only depends on the present moment, not the entire past.
Explanation: The Markov property makes calculations much simpler because we don't need to store and use the entire
history of data to make a prediction; we only need the most recent state.

1.3. Benefits of State Space Model (SSM)


Why are SSMs becoming so popular?

Less Data Needed: Because of the Markov "memoryless" property, we don't need huge amounts of historical data.
Flexible Rules: They allow coefficients to change over time, which is great for modeling real-world situations where things
aren't always constant (like before and after a big financial crisis).
Handles Missing Values: They can cleverly fill in the gaps if some of your data is missing.
Models Complex Systems: Great for modeling systems with many moving parts (multivariate systems).
Works with Non-Stationary Data: Can be used on time series that don't have a constant mean or variance.

1.4. Estimation of Parameters for State Space Model


To use an SSM, we need to estimate the values of its key parts: A , Φ , V and W . The most popular methods are:
t t t t

Maximum Likelihood Estimation (MLE).


Bayesian Estimation: This is a very natural fit for SSMs. The whole idea of Bayesian statistics is "updating your beliefs
based on new evidence," which is exactly what an SSM does when it uses a new observation to update its estimate of the
hidden state.

2. Kalman Filter 🤖
The Kalman Filter is the most famous type of State Space Model. It's a linear model that assumes the random noise is normal
(Gaussian). It's incredibly popular because it's efficient and doesn't require tons of computing power.

2.1. Kalman Filter Model Setup


Here are the equations for a time-invariant Kalman Filter:

Observation Equation:

yt = F xt + vt , v t ∼ N (0, V )

State Equation:

x t = Gx t−1 + w t , w t ∼ N (0, W )

Let's list out all the players in this model:

xt : The vector of hidden states.


yt : The vector of observations.
F : The observation matrix that links states to observations.
G : The state transition matrix that describes how states evolve.
vt : The measurement noise, from a normal distribution.
wt : The system noise, also from a normal distribution.
V : The covariance matrix for the measurement noise.
W : The covariance matrix for the system noise.
m0 : The starting guess for the mean of the initial state, x . 0

C0 : The starting guess for the covariance of the initial state, x . 0

2.2. Kalman Filter Model Derivation


The Kalman Filter works like a cycle of predicting and updating, which is very similar to the Bayesian way of thinking.

We can express the "filtering" goal using the Bayes theorem:

P (y t |x t )P (x t |D t−1 )
P (x t |y t , D t−1 ) =
P (y t )

Posterior (P (x t
): Our updated belief about the state, after seeing today's observation.
|y t , D t−1 )

Likelihood (P (y ): How likely is the observation we saw, given a certain state?


t |x t )

Prior (P (x t |D t−1 ) ): Our prior belief about the state, before seeing today's observation.

This process gives us a set of results called the Kalman filter recursion. The most important result is how to find the mean (
m ) and variance (C ) of the state today.
t t

The key formula for the estimated state is:

mt = at + Kt et

Let's break that down with an analogy:

mt : Your new best guess for the state.


at : Your forecast of the state, based on yesterday's info.
et : The surprise! This is the difference between what you actually saw (y ) and what you forecasted (f ).
t t
Kt : The Kalman Gain. This is a special factor that decides how much you should react to the "surprise." If the Kalman Gain
is high, you trust the new observation a lot. If it's low, you stick more closely to your original forecast.

This is a recursive method—it only needs the result from the last period (t − 1) and the new data from today (t) to work. No
need to re-process all the old data!

Quiz Question: What does the Kalman Gain (K ) do in the Kalman Filter?
t

Answer: It acts as a tuning factor to decide how much weight to put on the new observation versus the previous forecast.
Explanation: A high Kalman Gain means the new observation is very influential, causing a big update to the state estimate. A
low gain means the new observation has less impact.

2.3. Forecasting with Kalman Filter


The Kalman filter is amazing for forecasting. We can predict the states and observations for any future time step (t + d).

For one step ahead (d = 1):

Forecast State (x t+1 ): We use the transition matrix G on our current best estimate m . t

Forecast Observation (y t+1


): We use the observation matrix F on our forecast state x t+1
.

For more than one step ahead (d > 1), we just repeat the process recursively!

3. Use Kalman Filter to Fit GARCH(1,1) Model 📈


Now for the main event: combining the Kalman Filter with GARCH! The goal is to let the noise in our model have changing
volatility, which is perfect for financial data.

Here's the combined model:

yt = F xt + zt

x t = Gx t−1 + w t , where w t ∼ N (0, W )

z t = s t ϵ t , where ϵ t ∼ N (0, 1)

2 2 2
s t = α 0 + α 1 z t−1 + β 1 s t−1

The first two equations are our standard Kalman Filter.


The third and fourth equations are our GARCH(1,1) model.
The key insight is that the measurement noise (z ) from the Kalman Filter's observation equation is now being modeled by
t

GARCH!

The main difference this makes is in the one-step forecast for the observation's variance:
′ 2
Qt = F Rt F + st

Instead of a constant noise variance V , we now have s , the time-varying variance from the GARCH model.
2
t

To estimate all the unknown parameters in this combined model, a special technique called Forward Filtering and Backward
Sampling (FFBS) is often used.

4. Shiny Application For GARCH(1,1) Model 💻


The coursework mentions a Shiny application. A "Shiny app" is an interactive web page built with the R programming language
that lets you play with data and models without writing any code.

Based on the description, this app would let you:

1. Choose an asset: Pick from Google stock returns, EUR/USD exchange rates, or Treasury bond yields.
2. Choose an error distribution: Select normal, Student's t, etc., for your GARCH model.
3. View Results:
See plots of the asset's returns (time series, histogram, QQ plot).
See plots of the squared returns to check for volatility clustering (ACF and PACF plots).
Get the final GARCH(1,1) model results and diagnostic checks.
5. Conclusion ✅
In this lesson, we journeyed through the powerful world of State Space Models. We learned the basics, saw their properties
and benefits, and dove into the famous Kalman Filter. We then successfully combined the Kalman Filter with a GARCH(1,1)
model to handle changing volatility in a sophisticated way.

You might also like