Fe Module 5
Fe Module 5
Required Reading
Hey there! 👋 Welcome to your guide on the ARCH model. It might sound complicated, but don't worry! We're going to break it
down with simple examples and analogies to make it super clear. We'll explore why stock prices can be so wild and
unpredictable, and how we can model their crazy swings. Let's get started! 🚀
ARCH MODEL
Get ready to dive into the world of financial time series! We're going to learn about a cool tool called the ARCH model that
helps us understand and predict the ups and downs (volatility) of things like stock prices.
1.1 Asset Return Time Series Data is More Stable Than Asset Price Time Series Data
Imagine you have a super fast-growing plant. 🌱 If you just plot its height every day, you'll see a line that mostly goes up, up,
up! This is like a stock's price.
Now, what if you instead plotted how much it grew each day as a percentage? Some days it might grow 2%, some days 1%,
and some days maybe even shrink a tiny bit. This percentage change is the return. You'd notice this plot bounces around an
average level (like 1.5% growth per day), making it much more stable and predictable than the ever-increasing height.
a. Easy Comparisons: A return is a percentage. This means you can easily compare the performance of a $10 stock with a
$1000 stock. A 5% return is a 5% return, no matter the price! It puts everything on the same scale.
b. Stability: As we saw with the plant analogy, asset prices often have a trend (they go up or down over time), while
returns tend to hover around a stable average (often close to zero for daily returns). This stability makes them much easier
to model.
Google's Stock Price (2016-2021): This chart looks like a mountain climbing steadily upwards. It has a clear trend.
Google's Stock Return (2016-2021): This chart looks like a fuzzy line bouncing wildly around the zero line. It doesn't
have a trend; it's stable on average, which is great for our models!
c. A Cool Math Trick: For small changes, the return can be calculated easily using logarithms. The formula looks like this:
p t − p t−1
rt = ≈ log(p t ) − log(p t−1 )
p t−1
🧠 Quiz Question: Why is it better to use a stock's daily return for analysis instead of its daily price?
Answer: Because returns are more stable and hover around an average, unlike prices which tend to trend up or down. This
makes returns easier to compare and model.
Explanation: Think of it like comparing your test scores. Saying you improved by 10 points is useful, but saying you improved
by 10% (a return) is even better because it gives context, whether you went from 50 to 60 or 80 to 90. Plus, returns don't just
go up forever; they bounce around an average, making them predictable in a different way.
1.2 The Volatility of Asset Returns Can Vary During Different Time Periods
Volatility is just a fancy word for how much something swings up and down. High volatility means big swings (risky!), and low
volatility means small swings (safer).
If you look at the chart of Google's stock returns, you'll see something fascinating. It's not messy in the same way all the time.
There are calm periods with tiny wiggles, and then suddenly there are stormy periods with huge, wild swings.
This is called volatility clustering. 🌪️ It's like weather: you get a few calm, sunny days in a row, and then a week of
thunderstorms. The volatility isn't random; high-volatility days tend to be followed by more high-volatility days, and calm days
are followed by more calm days. This tells us that the past volatility has an impact on today's volatility.
Explanation: It's like a bag of popcorn. You'll hear a few pops, then a whole bunch of them popping like crazy, and then it
calms down again. The "crazy popping" periods are clustered together.
1.3 Asset Return Distribution Has Heavier Tails Than Normal Distribution
Imagine we plot every daily return of Google stock on a histogram. Most days, the return is very close to zero. But you'll also
see a surprising number of days where the stock had a huge gain or a massive loss.
This gives the distribution "heavy tails" or "fat tails." It means extreme events (big jumps or drops) happen more often than
you'd expect in a perfect "bell curve" (a Normal Distribution).
The coursework shows a histogram and a "Normal QQ Plot" for Google's returns.
Histogram: It looks mostly like a bell curve, but the bars at the far left (big losses) and far right (big gains) are taller than a
perfect bell curve would predict.
Normal QQ Plot: This plot compares Google's returns to a perfect normal distribution. If the returns were perfectly normal,
the dots would all fall on a straight red line. But we see the dots curve away from the line at the ends, which is the classic
sign of fat tails!
🧠 Quiz Question: What does it mean for a stock's return distribution to have "heavy tails"?
Answer: It means that extreme outcomes (very large gains or very large losses) are more common than a normal bell curve
would suggest.
Explanation: Think about grades in a class. A normal distribution would have most people getting a B or C. A heavy-tailed
distribution would have a surprising number of students getting an A+ or an F.
Conditional Mean: This is the expected average of something, given what we already know.
Analogy: What's the average height of a person? That's a regular mean. What's the average height of a person given
that they are a professional basketball player? That's a conditional mean, and it will be much higher!
Conditional Variance: This is the expected spread or volatility of something, given what we already know.
Analogy: What's the typical range of temperatures in a year? That's regular variance. What's the range of
temperatures given that it's July? That's a conditional variance, and it will be much smaller and hotter than the
yearly average.
The formulas are just a mathematical way of saying this:
μ = E(Y |X)
Y |X
In our stock return models, we use past information (like yesterday's return, X t−1
) to predict today's return (X ).
t
The ARMA models you learned about predict the conditional mean. They say:
But we saw that volatility changes! So, we need to model the conditional variance too. We can update our model to look like
this:
The key idea is that today's volatility isn't constant; it's a function that depends on past values! This is exactly what ARCH and
GARCH models do. For the rest of this lesson, we'll assume the conditional mean of asset returns is zero (they just bounce
around 0) and focus only on the exciting part: the changing variance.
🧠 Quiz Question: In simple terms, what's the difference between a conditional mean and a conditional variance?
Answer: A conditional mean is a predicted average based on some information (e.g., expected temperature in summer). A
conditional variance is the predicted spread or risk based on some information (e.g., expected temperature range in summer).
3. ARCH(1) Model
Let's meet our first volatility model! ARCH stands for AutoRegressive Conditional Heteroskedasticity. Whoa, what a mouthful!
Let's break it down:
AutoRegressive (AR): Means it uses past values of itself to predict the future. Just like an AR model for returns.
Conditional (C): The prediction depends on what happened recently.
Heteroskedasticity (H): A fancy word for changing or non-constant variance. (Homo-skedasticity means constant
variance).
So, an ARCH model is one where the changing variance depends on past values.
The simplest version is the ARCH(1) model, which only looks back one day. It has two simple equations:
rt : Today's return.
σt : Today's volatility (standard deviation).
et : A random shock, like a dice roll. It's usually a standard normal variable (mean 0, variance 1).
This equation says that today's return is just a random shock scaled up or down by today's volatility. If volatility (σ ) is high, t
the return can be huge (positive or negative). If it's low, the return will be small.
2
σt : Today's variance (volatility squared).
α0 : A constant, baseline level of variance. The "normal" amount of chaos.
α1 : A number between 0 and 1 that says how much yesterday's shock affects today's variance.
r
2
t−1
: Yesterday's return, squared. We square it because we care about the size of the move (e.g., -2% and +2% have the
same magnitude), not the direction.
This is the magic! It says today's volatility is determined by a baseline level (α ) plus a fraction (α ) of how big yesterday's
0 1
move was (r ). If yesterday had a huge return (a big shock), today's variance will be higher! This is how the model creates
2
t−1
volatility clustering.
Diagram:
Yesterday
Yesterday's Return
rt-1
Square it!
r2t-1
Today
This just means the process doesn't explode to infinity. As long as α is less than 1, the model will always stay under control.
1
3.2 r is Conditionally Normally Distributed
t
If we know what yesterday's return (r ) was, then we know today's variance (σ ). This means today's return will follow a
t−1
2
t
normal distribution (a bell curve) with a mean of 0 and that specific variance.
2
r t |r t−1 ∼ N (0, α 0 + α 1 r t−1 )
This is a bit tricky but super important. If you look at the returns unconditionally (without knowing the past), they look totally
random!
So, the series is conditionally heteroskedastic (variance changes based on yesterday) but unconditionally
homoscedastic (the long-run average variance is constant).
Even though the returns are uncorrelated, they are not independent. The volatility of today's return clearly depends on the
size of yesterday's return. This is a great example that "uncorrelated" does not automatically mean "independent"!
This is the big clue for detectives! 🕵️ While the returns (r ) look like random noise, the squared returns (r ) behave just like
t
2
t
an AR(1) process.
2 2
r t = α 0 + α 1 r t−1 + ν t
This means we can use our old tools (ACF and PACF plots) on the squared returns to see if an ARCH(1) model is a good fit. If
the PACF of the squared returns cuts off after lag 1, we've found our suspect!
The ARCH model naturally produces more extreme outcomes (fat tails), just like we see in real financial data. The periods of
high volatility allow for huge price swings to happen.
Why only look back one day? An ARCH(m) model looks back m days. The variance equation just gets a bit longer:
2 2 2 2
σ t = α 0 + α 1 r t−1 + α 2 r t−2 + ⋯ + α m r t−m
This lets today's volatility be influenced by the events of the last few days, not just yesterday.
🧠 Quiz Question: In an ARCH(1) model, if a stock has a huge price drop yesterday, what is likely to happen to its volatility
today?
r
2
will be a large positive number. This makes today's variance, σ , bigger, leading to higher volatility.
t−1
2
t
4. ARCH(1) Simulation
Let's see an ARCH(1) process in action! The coursework created a fake time series using the ARCH(1) rules with α 0 = 5 and
α 1 = 0.5 .
The Simulation Plot: The chart shows a series that mostly wiggles around zero. But then, there are clear "clusters" where
it goes wild with big spikes up and down (like around time=600). This looks just like the real Google returns data! It
successfully created volatility clustering.
Now, let's play detective with the ACF and PACF plots.
ACF and PACF of the Returns: The plots show no significant bars sticking out of the blue shaded area. This means the
returns are uncorrelated and look like random white noise. If we stopped here, we'd miss the whole story!
ACF and PACF of the SQUARED Returns: This is where we find the evidence!
The ACF plot shows bars that slowly decay, like a ramp going down.
The PACF plot shows one big, significant bar at lag 1, and then all the other bars are not significant (they are inside
the blue area).
This "decaying ACF and PACF cut-off at lag 1" pattern is the classic fingerprint of an AR(1) process. Since we found it in the
squared returns, we can be confident that an ARCH(1) model is a great way to describe this data.
5. Conclusion
Awesome job! 🎉 You've learned about the key features of financial returns, like volatility clustering and heavy tails. You've
also been introduced to your first volatility model, ARCH(1), which can capture these features beautifully. The main trick is to
remember that while returns might look random, their squared values can tell a hidden story.
In the next lesson, we'll build on this by learning about the GARCH model, which is like ARCH's even more popular and
powerful cousin. See you there! 🌟
Main Lesson 1
Hey there! 👋 Let's try this one more time, and I am committed to getting it exactly right for you. My apologies for the
previous omissions. We are going to walk through this Volatility Modeling coursework, and I will make sure to cover every
single detail, formula, and concept presented, explaining it all in a clear and engaging way.
Think of volatility as a measure of uncertainty or risk in finance. It’s the "wobble" in a stock's price. Let's learn how the pros
measure, predict, and model this wobble. 🎢
R t = log(P t /P t−1 )
R̄
⎪
Step 2: Calculate the Standard Deviation (σ) 📊
The standard deviation measures how much the returns are spread out from their average. This is the raw, periodic volatility.
The theoretical formula is:
When we use our actual data, we calculate the sample estimate like this:
^ =
σ
~2
1
⎷ T − 1
~2
σ t+1 =
σ t+1 =
m
T
∑(R t − R̄)
t=1
σ t+1 = (1 − β)^
can be fit on all data or on a "rolling window" of the most recent data.
~2
σ t+1 = γ 1,t σ
2
^ t + γ 2,t σ
2
2
∑σ
^j
j=1
∑ σ
~2
2
2
To compare the volatility of different assets, we convert the periodic volatility (like daily) into a yearly figure. You do this by
multiplying by the square root of the number of periods in a year.
ˆ = ⎨√
vol
⎧√ 252^
⎩
52^
σ
√ 12^
(f or weekly prices)
(f or monthly prices)
Historical Average: Predicts that future variance will be the average of all variances up to the current point. This uses all
data but can be slow to react to new market conditions.
t
Simple Moving Average: Uses the average of only the last 'm' periods. It's more responsive because it drops old data.
m−1
2
^ t−j
j=0
Exponential Moving Average: A smart blend that uses all available data but gives more weight to recent events. The
parameter β (between 0 and 1) controls how quickly the influence of old data fades.
~2
σ t + βσ t
Simple Regression: Uses a regression model to predict tomorrow's variance based on the last 'p' periods of variance. This
^ t−1 + ⋯ + γ p,t σ
2
^ t−p+1 + u t
The Key Trade-Off: You must balance using lots of data for a precise estimate versus using recent data that is more relevant
for today's market. After making forecasts, you evaluate them using metrics like MSE (Mean Squared Error) and MAE (Mean
🧠 Quiz Question: What is the main trade-off when deciding how much historical data to use for a volatility forecast?
Answer: It's a trade-off between precision and relevance.
Explanation: Using more data (a longer history) can give you a more statistically stable and precise estimate. However, using
only recent data ensures your forecast is more relevant to the current market environment, which might be very different from
the market 10 years ago.
2. Geometric Brownian Motion (GBM)
GBM is a foundational model for asset prices that assumes they move randomly but with a general trend. Think of it like a
remote-controlled car with a wonky wheel: it drives in a general direction, but it's constantly wobbling randomly.
Under GBM, the log-returns of the asset are normally distributed. The parameters μ and σ can be estimated from data using
Maximum-Likelihood Estimation. For periods of equal length (Δ j ), the estimates are:
= 1
μ
^ ∗ = R̄
2
1 2
σ
^ = ∑(R t − R̄)
n
1
Think of it like this: if you want to know how much the temperature changed in a day, just knowing the temperature at 5 PM
(the close) isn't the full story. Knowing the day's high and low gives you a much better picture of the temperature's volatility.
The coursework presents a series of increasingly sophisticated (and efficient) estimators. "Efficiency" here means how good
the estimate is. An efficiency of 7.4 means the estimator is 7.4 times better (less noisy) than the basic close-to-close
estimator!
Close-to-Close: σ
^
2
0
= (C 1 − C 0 )
2
. The simplest method, with a base efficiency of 1.
Parkinson (1976): Uses the high and low prices, achieving an efficiency of about 5.2.
2
2
(H 1 − L 1 )
σ
^3 =
4(log 2)
Garman and Klass (1980): Created a "Best Analytic Scale-Invariant Estimator" that combines the high, low, and close
data relative to the open price. This estimator achieves an efficiency of about 7.4.
2 2 2
σ
^ ∗∗ = 0.511(u − d) − 0.019{c(u + d) − 2ud} − 0.383c
The Composite GK Estimator: This is the final version that also includes the overnight jump from the previous day's close
(C ) to the current day's open (O ). This powerful estimator reaches a final efficiency of about 8.4.
0 1
2 2
2
(O 1 − C 0 ) σ ∗∗
σ
^ GK = a + (1 − a)
f (1 − f )
dS(t)
= μdt + σdW (t) + γσZ(t)dΠ(t)
S(t)
The new piece is the jump term, dΠ(t), which is an increment from a Poisson Process. This process models the occurrence of
a jump, while the Z(t) term models the magnitude of that jump. This allows the model to have both small, continuous changes
and large, sudden ones.
4. ARCH Models
The Autoregressive Conditional Heteroskedasticity (ARCH) model was a game-changer because it was designed for a
world where volatility is not constant.
The core idea is that today's volatility depends on the size of yesterday's surprises (or shocks). The model for the return y is: t
yt = μt + ϵt
This says that today's variance, σ , is a weighted average of past squared shocks (ϵ ). This means the squared shocks follow
2
t
2
an AR(p) process.
5. GARCH Models
The Generalized ARCH (GARCH) model by Bollerslev (1986) is an extension of ARCH and is hugely popular in finance.
GARCH says today's variance depends on yesterday's shocks AND yesterday's variance. It has "memory." The widely used
GARCH(1,1) model is:
2 2 2
σ t = α 0 + α 1 ϵ t−1 + β 1 σ t−1
α1 ϵ
2
t−1
is the ARCH term (reaction to news).
β1 σ
2
t−1
is the GARCH term (persistence of volatility).
This structure implies that the squared shocks, ϵ , follow an ARMA(1,1) process. For the GARCH model to be stable
2
t
Volatility Clustering: A large shock or high variance yesterday leads to high variance today, creating clusters of high and
low volatility.
Heavy Tails: GARCH models naturally produce distributions with more extreme outcomes (higher kurtosis) than a simple
Gaussian distribution, which is true for real returns.
Mean Reversion: Volatility doesn't explode to infinity. It always tends to revert to its long-run average variance, which for
GARCH(1,1) is:
α0
2
σ∗ =
(1 − α 1 − β 1 )
🧠 Quiz Question: What is the key difference in the variance equation between an ARCH(1) model and a GARCH(1,1) model?
Answer: The GARCH(1,1) model adds a term for the previous period's variance, β 1
σ
2
t−1
.
LESSON 2
Welcome to your first class on Bayesian Updating! 🎉 This might sound super advanced, but it's actually a really cool and
logical way to update our beliefs when we get new information. Think of yourself as a detective. You start with a guess (a
prior belief), then you find a clue (the data), and that clue helps you update your guess to be more accurate (a posterior
belief). Let's dive in!
1. Use Bayes' Theorem: We'll learn how to use a special formula to figure out probabilities like a pro.
2. Know the Lingo: We'll get familiar with key terms like prior, likelihood, posterior, data, and hypothesis. You'll see how
each one plays a role in our detective work.
3. Use Update Tables: We'll learn how to organize all our work in a simple table that makes complex problems way easier to
solve.
P (D|H)P (H)
P (H|D) =
P (D)
: The probability that our hypothesis is true, given the data we saw. This is what we usually want to find! It's our
P (H|D)
Quiz Question: Imagine your friend loves pizza. The probability that your friend is happy if they eat pizza is high. Does this
mean the probability they ate pizza if they are happy is also high?
Answer: Not necessarily!
Explanation: They could be happy for lots of other reasons, like acing a test or getting a new video game. This is the core
idea of Bayes' theorem – it helps us figure out the real probability instead of just jumping to conclusions!
I pick one coin at random, flip it, and it lands on HEADS. What's the probability it was a Type A, Type B, or Type C coin?
Diagram:
Start
To use Bayes' formula, we first need the total probability of getting heads, P (D). We find this by adding up the chances of
getting heads from each coin type:
P (D) = (0.5 × 0.4) + (0.6 × 0.4) + (0.9 × 0.2) = 0.20 + 0.24 + 0.18 = 0.62
Now we can find our posterior probabilities!
0.5×0.4 0.20
P (A|D) = = ≈ 0.3226
0.62 0.62
0.6×0.4 0.24
P (B|D) = = ≈ 0.3871
0.62 0.62
0.9×0.2 0.18
P (C|D) = = ≈ 0.2903
0.62 0.62
Instead of doing all that math separately, let's use a Bayesian Update Table. It makes life so much easier!
Hypothesis (H) Prior P(H) Likelihood P(D|H) Bayes Numerator P(D|H)P(H) Posterior P(H|D) (Numerator / Total)
This whole process of starting with a prior and using data to get a posterior is called Bayesian Updating. You just updated
your beliefs with evidence! 🧠
P (data|hypothesis)P (hypothesis)
P (hypothesis|data) =
P (data)
The symbol ∝ just means "is proportional to." It tells us that the posterior is driven by the prior multiplied by the likelihood.
Let θ (theta) be the value for our hypothesis. For our coins, θ could be 0.5, 0.6, or 0.9.
Let x be our data. We'll say x = 1 for Heads and x = 0 for Tails.
The prior pmf p(θ) is just our table of prior probabilities.
The posterior pmf p(θ|x = 1) is our table of posterior probabilities after seeing one head.
Example 2: What if the flip was TAILS (x = 0)? Let's redo the table!
The probabilities of getting tails are:
Hypothesis (θ) Prior p(θ) Likelihood p(x = 0∥θ) Bayes Numerator p(x = 0∥θ)p(θ) Posterior p(θ∥x = 0)
Wow! After getting tails, the fair coin (Type A) is now the most likely. The super-bent coin that loves heads (Type C) is now
extremely unlikely. Our beliefs updated correctly based on the new clue!
Quiz Question: In the "Tails" example, why did the posterior probability for Type C (θ = 0.9) drop so much?
Answer: Because Type C coins are really good at getting heads (90% chance).
Explanation: Seeing a "Tails" is very surprising if the coin is Type C. This surprising data provides strong evidence against the
hypothesis that the coin was Type C, so its probability plummets.
Example 3: Let's go back to our first example. We flipped the coin and got Heads. Now, we flip the same coin a second time
and get Heads again. What are the probabilities now?
We can just add more columns to our table! The "Bayes Numerator 1" is the result from our first flip. We multiply that by the
likelihood of getting a second head to get "Bayes Numerator 2".
0.5 0.4 0.5 0.20 0.5 0.20 × 0.5 = 0.100 0.100/0.406 ≈ 0.2463
0.6 0.4 0.6 0.24 0.6 0.24 × 0.6 = 0.144 0.144/0.406 ≈ 0.3547
0.9 0.2 0.9 0.18 0.9 0.18 × 0.9 = 0.162 0.162/0.406 ≈ 0.3990
Total 1 0.406 1
After two heads in a row, the Type C coin has finally become the most likely! The repeated evidence for 'Heads' was so strong
it overcame Type C's low prior probability. Each new piece of data helps us zero in on the most likely hypothesis.
The disease is rare: only 0.5% of the population has it. If a random person tests positive, what is the probability they actually
have the disease?
The result is shocking! Even with a positive test from a 99% accurate test, the chance you actually have the disease is only
about 20%!
Why? This is the base rate fallacy. The "base rate" (the prior probability of having the disease) is incredibly low (0.5%). There
are so many more healthy people than sick people that the small percentage of "false positives" from the healthy group
creates a bigger pool of positive tests than the "true positives" from the sick group.
Welcome to the fascinating world of probabilistic prediction! 🎉 This is all about making smart, number-based guesses about
the future, from predicting the weather to the outcome of a video game. Let's learn how to update our predictions as we get
new information!
2. Introduction 🚀
In our last lesson, we saw how to update our beliefs about something (like which type of coin we're holding) based on new
evidence. Now, we're taking it a step further: we'll use that new evidence to predict what might happen next!
Using precise probabilities helps us understand the world better. It's used everywhere:
3. Predictive Probabilities
Probabilistic prediction is all about assigning a specific probability to every possible outcome. Let's use a fun example from
our last class to see how it works.
2 Type A coins: Perfectly fair, with a 50% chance of heads (P (H eads|A) = 0.5).
1 Type B coin: Slightly biased, with a 60% chance of heads (P (H eads|B) = 0.6).
1 Type C coin: Very biased, with a 90% chance of heads (P (H eads|C) = 0.9).
You reach in and pick one coin at random. What's the chance you grabbed each type?
2 Type A coins
P (A) = = 0.5
4 total coins
1 Type B coin
P (B) = = 0.25
4 total coins
1 Type C coin
P (C) = = 0.25
4 total coins
These are our prior probabilities—our beliefs before we've flipped the coin.
Think of it like this: your overall chance of getting heads is a weighted average of the chances for each coin type. The
"weight" is the probability of having picked that coin type in the first place.
Diagram:
Pick a Coin
P(H|A) = 0.5 P(H|B) = 0.6 P(H|C) = 0.9 P(T|A) = 0. P(T|B) = 0.4 P(T|C) = 0.
Heads Tails
Hypothesis (Coin Type) Prior P(H) Likelihood P(D|H) Bayes Numerator P(D|H)P(H) Posterior P(H|D)
Total 1 0.625 1
Now, we want to predict the outcome of a second flip. This is the posterior predictive probability. We do the exact same
calculation as before, but we use our new posterior probabilities as the weights!
Let's predict heads on the second flip, given we got heads on the first (D H2
|D H 1 ):
Our prediction for heads on the next toss has increased from 62.5% to 66.8%! This is because the first heads made us
suspect we have a coin that's good at landing heads (like Type C), so we adjust our future predictions accordingly.
Quiz Question: Why did the probability of getting heads on the second toss increase after we saw heads on the first toss?
a) Because the coin physically changed.
b) Because the first result made us more confident, which affects the coin.
c) Because the first result updated our belief, making us think it's more likely we have a coin that favors heads.
Answer: c) Because the first result updated our belief, making us think it's more likely we have a coin that favors heads.
Explanation: The coin itself doesn't change, but our knowledge about it does! The data (getting heads) points towards the
coin being type B or C, which have a higher probability of landing heads.
3.3 Review 📝
Let's break down the key terms to avoid confusion.
Essentially, you're using your belief about the hypothesis (which coin is it?) to predict the outcome (will it be heads?). As you
get data, your belief about the hypothesis changes, and so does your prediction for the next outcome!
Welcome back! 🧠 Today, we're taking a giant leap in our ability to make predictions. Before, we dealt with a handful of
choices, like guessing if a die has 4, 6, or 8 sides. Now, we're going to learn how to handle a smooth, continuous range of
possibilities, like a bent coin where the chance of heads could be any number between 0 and 1!
1. Understanding a Continuous Range of Hypotheses: We'll see how a "family" of possibilities (like a million slightly
different bent coins) can be our set of hypotheses.
2. Using Bayes' Theorem for Continuous Ideas: You'll learn how to use the same powerful Bayes' theorem, but this time
with smooth graphs (PDFs) and integrals instead of simple sums.
3. Updating from a Prior to a Posterior PDF: We'll practice starting with an initial guess (a prior probability density function)
and using new evidence to create a refined, smarter guess (a posterior PDF).
4. Making Predictions with New Knowledge: You'll be able to use your updated guess to predict what's likely to happen
next!
2. Introduction 🚀
So far, our Bayesian updating has been for a limited, countable number of hypotheses. Now, we're upgrading to handle
situations with an infinite, continuous range of possibilities.
The best part? The core logic is exactly the same. We're just swapping out our tools:
Think of it like upgrading from a pixelated image to a high-resolution photo. The underlying picture is the same, but now we
can see all the smooth details in between!
4. Notational Conventions ✍️
To keep things neat, let's agree on some notation.
Big Letters (Events): This is the conceptual level. We talk about the probability P of an Event A , like P(Coin is Type A) .
Little Letters (Values): This is the calculation level. We use a pmf p(x) or pdf f(x) for a specific value x .
Prior P (H) p(θ) or f (θ) Our initial belief about the hypothesis.
Total Prob. P (D) p(x) or f (x) The overall chance of seeing the data x .
Posterior P (H ∣ D) p(θ ∣ x) or f (θ ∣ x) Our updated belief after seeing the data.
From now on, we'll mostly use the little letters. We'll use Greek letters ( θ , λ , μ , σ ) for hypotheses and English letters ( x , y ) for
data.
Analogy: Think of a map showing population density. The map tells you how crowded an area is (people per square mile), but
to find the actual number of people, you have to look at a specific area (an interval).
For a PDF, the probability that our variable X falls between c and d is the area under the curve from c to d .
d
P (c ≤ X ≤ d) = ∫ f (x)dx
c
The tiny slice of probability in an infinitesimally small range dx around a point x is f (x)dx. The integral is just our way of
summing up all these tiny slices!
P (D) = ∑ P (D|H i )P (H i )
For a continuous range, we do the same thing but replace the sum with an integral!
We integrate the likelihood of heads ( θ ) times the prior for θ ( 2θ ) over all possible values of θ (from 0 to 1).
1 1
1 1
3
2
2θ 2 2
p(x = 1) = ∫ 2θ dθ = [ ] = − 0 =
0
3 3 3
0
Our prior belief leaned towards heads-biased coins, so it makes sense our overall prediction for heads is high (2/3)!
Quiz Question: Why is the law of total probability for continuous variables an integral instead of a sum?
a) Integrals are more accurate than sums.
b) Because there are infinitely many hypotheses in a continuous range, and an integral is like an infinite sum over a continuous
range.
c) Because the data is continuous.
Answer: b) Because there are infinitely many hypotheses in a continuous range, and an integral is like an infinite sum over a
continuous range.
Explanation: An integral is the perfect tool from calculus for summing up contributions over a smooth, continuous interval.
Likelihood × Prior
Posterior =
Total Probability
p(x|θ)f (θ)
f (θ|x) =
p(x)
The likelihood of getting HTT is p(x|θ) = θ ⋅ (1 − θ) ⋅ (1 − θ) = θ(1 − θ) . Wait, the coursework says HHT, which is θ
2 2
. Let's
(1 − θ)
θ [0,1] 2θdθ
2
θ (1 − θ)
2
2θ ⋅ θ (1 − θ)dθ
Numerator
Total
Calculation 3
2θ (1 − θ)dθ
3
20θ (1 − θ)dθ
Total 1 p(x) = ∫
0
1
3
2θ (1 − θ)dθ =
10
1
1
guess!
Total 1 p(x = 1) = ∫
0
1
θdθ =
1
2
1
The posterior PDF is f (θ|x = 1) = 2θ. We started with a "don't know" belief, and after one heads, our belief shifted to favor
higher values of θ .
Before seeing any data, what was the probability the coin was biased towards heads ( θ > 0.5 )?
1 1
1
1
P (θ > 0.5) = ∫ f (θ)dθ = ∫ 1dθ = [θ] 0.5 = 1 − 0.5 =
0.5 0.5
2
Our belief jumped from 50% to 75% after just one flip!
Prior predictive probability of heads (we found this in Ex 5): p(x 1 = 1) = 2/3 .
Now, suppose the first flip was heads (x 1 = 1 ). We need the posterior PDF. The coursework has a calculation error here.
Let's fix it.
Prior: f (θ) = 2θ
Likelihood: p(x 1 = 1|θ) = θ
Bayes Numerator: 2θ ⋅ θ = 2θ 2
Now, what's the posterior predictive probability of heads on the second flip? We use our new posterior PDF:
1
1 1 1
4
2 3
3θ 3
= ∫ θ ⋅ (3θ )dθ = ∫ 3θ dθ = [ ] =
0 0
4 4
0
Our prediction for heads increased from 2/3 (66.7%) to 3/4 (75%) after seeing one heads.
Imagine we approximate a flat prior by saying θ can only be one of 4 values: 1/8, 3/8, 5/8, 7/8 . We give each a prior
probability of 1/4.
If we get one heads, we can make a standard discrete update table. The result is a bar chart for our posterior belief.
As we slice the continuous range into more and more tiny pieces, our blocky histograms for the prior and posterior get
smoother and smoother. Eventually, in the limit, they converge to the smooth PDF curves we've been calculating with
integrals! The discrete sum in the update table becomes the integral for the total probability.
Required Reading
Hello! Let's dive into one of the biggest ideas in statistics: the Bayesian approach. Think of it as a different philosophy for how
to think about evidence and uncertainty. It's like being a detective with a powerful new way to solve cases! 🕵️♂️
1. A Parametric Model (f (V |θ)): This is your theory about how the world works. It's a rule that says, "If the secret parameter
T
θ has a certain value, here's the probability of seeing the data V ."
Analogy: This is your "theory of the crime." For example, "If the suspect is left-handed ( θ ), this is the likely pattern of
evidence (f (V |θ))."
T
2. A Prior Distribution (p(θ)): This is your belief, hunch, or suspicion about the secret parameter before you see any
evidence.
Analogy: This is your "initial list of suspects." You might have a prior belief that some suspects are more likely than
others based on past experience.
By combining these two pieces using Bayes' Rule, you get the Posterior Distribution (p(θ|V )). This is the magic part! It T
represents your updated belief about the parameter after you've seen the data.
f (V T |θ)p(θ)
p(θ|V T ) =
p(V T )
(Likelihood of seeing the evidence given your theory) × (Your initial belief )
Updated Belief =
(Overall probability of seeing that evidence)
Once you have this updated belief (the posterior), you can make smart conclusions, like picking the most likely value for your
parameter or creating a "credible set"—a range where you're 95% sure the true parameter value lies.
Quiz Question: What is a "prior" in the Bayesian world?
a) The evidence you collect.
b) Your final conclusion after seeing the data.
c) Your initial belief or hunch about something before you see any evidence.
Answer: c) Your initial belief or hunch about something before you see any evidence.
Explanation: The prior is your starting point, which you then update with data to get your posterior (final) belief.
Imagine an archer trying to hit a bullseye ( θ ) that is fixed but hidden behind a screen.
What is θ (the Fixed, but unknown. A random variable we have beliefs about.
parameter)?
What is probability? The long-run frequency of an event over many A degree of belief or confidence about
repeated trials. something.
What is fixed? The parameter θ is fixed. The data V is fixed (once you've seen it).
Confidence/Credible "95% confidence" is about the reliability of the "95% credible set" is a direct probability
Sets method over many repeats. statement about where θ likely is.
Reasons to be Bayesian
Why might you choose the Bayesian way of thinking? Here are a few powerful reasons.
This is especially true in time series, like analyzing the stock market or inflation. We only have one history of the world! The
frequentist idea of "repeated samples" (imagining thousands of other worlds to get different stock market histories) can feel a
bit strange. The Bayesian approach of updating our beliefs as one single timeline unfolds often makes more sense.
Reason 2 - Convenient Shortcuts Exist (Conjugate Priors) शॉर्टकट
Sometimes, the math can be really clean. A conjugate prior is when your prior belief and the data's likelihood are from the
same "family" of distributions. When you combine them, the posterior belief is also in that same family!
Analogy: It's like mixing colors. If you start with a specific shade of blue paint (a Normal distribution prior) and mix it with data
that also behaves like a Normal distribution, you're guaranteed to get a perfect shade of green that is also a Normal
distribution (the posterior). This makes calculations much easier.
Analogy: Two detectives are on a case. Detective A has a strong hunch (prior) that the butler did it. Detective B has no idea. If
they collect a mountain of DNA evidence that points to the gardener, both detectives will end up agreeing that the gardener is
the culprit. The initial hunch becomes irrelevant in the face of overwhelming evidence.
This means that for large datasets, Bayesian and frequentist results often look very similar.
Analogy: You want to find the average height of students ( θ₁ ). Your model also includes their average shoe size ( θ₂ ), which
you don't care about. The Bayesian framework has a very natural way to "integrate out" or average over all the possibilities for
shoe size, giving you a clean answer just for height. This can be extremely difficult in the frequentist world.
Quiz Question: What happens to the Bayesian's prior belief when a very large amount of data is collected?
a) The prior becomes even more important.
b) The prior's influence washes out and becomes irrelevant.
c) The prior and the data become equally important.
Posterior Mean: The average value of your posterior distribution. This is the best choice if you want to minimize the
squared error of your guess.
Posterior Mode: The peak of your posterior distribution—the single most likely value.
Posterior Median: The middle value (50th percentile). This is a good choice if you want to avoid big errors in one direction
more than the other.
Testing Hypotheses
In the Bayesian world, you test ideas by comparing their posterior probabilities. For a hypothesis H ("the parameter is 0") vs.
0
H ("the parameter is not 0"), you can calculate the posterior odds ratio:
1
P (H 0 |Data)
Posterior Odds =
P (H 1 |Data)
If this ratio is large, you favor H . If it's small, you favor H . This approach treats both hypotheses more symmetrically than the
0 1
A tricky problem is testing a point null (e.g., H 0 ). With a continuous prior, the probability of any exact point is zero! The
: θ = 0
clever solution is to use a prior that puts a specific chunk of belief, say 20%, exactly on θ = 0 and spreads the other 80% over
the other possibilities. This allows for a direct and sensible test.
Main Lesson 3
Hey there! 🚀 Let's dive into an exciting topic: Bayesian Estimation for GARCH models. It might sound complicated, but think
of it as being a data detective 🕵️♀️, using clues to update your guesses and uncover the truth. We'll break it all down step-by-
step. Let's get started!
Imagine you're trying to guess the score of a video game match before it starts. The old way (called Frequentist statistics) is
to just look at past data. The new way (Bayesian statistics) lets you use your own intuition or prior knowledge (like "Team A
has a superstar player!") and then update your guess as the game unfolds. It's a powerful and flexible way to think about data,
and it's used everywhere from finance to AI.
PYTHON
pip install arviz pymc pytensor
Now, let's import the libraries and load our dataset, which contains Google stock prices. We'll be looking at the daily returns of
the stock.
PYTHON
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
from pymc import GARCH11
from pytensor import shared
import pytensor
pytensor.config.mode = 'NUMBA'
PYTHON
# Download the dataset
m5_data = pd.read_csv("M5. goog_eur_10.csv")
1. Bayesian Statistics
1.1 What is Bayesian Statistics?
So, what's the big deal with Bayesian statistics? It all comes down to one core idea.
Frequentist Statistics: This is what you've likely learned in school so far. It treats parameters like the mean (average) or
variance (spread) of data as fixed, unknown numbers. Our job is to estimate that one true number.
Bayesian Statistics: This approach is different. It treats parameters as random variables. This means instead of one
single "true" answer, a parameter can have a whole range of possible values, each with a certain probability.
A frequentist would take a sample of gumballs, count them, and use that to make a single, best guess for the total
number in the jar.
A Bayesian, on the other hand, starts with an initial guess (maybe based on the size of the jar). This initial guess is called
a prior. Then, they take a sample of gumballs and use that new information to update their initial guess. The updated
guess, which is a range of likely values, is called a posterior.
This method is sometimes seen as more flexible, but it can also be criticized because the researcher's initial guess (the prior)
can influence the result. It's a big debate in the world of statistics!
Quiz Question: What is the main difference between how frequentist and Bayesian statistics view parameters (like the mean)?
Answer: Frequentist statistics sees parameters as single, fixed values that we try to estimate. Bayesian statistics sees them
as random variables that have a probability distribution.
Explanation: Think of it like finding a lost key. A frequentist believes the key is in one specific spot, and they're trying to find
that spot. A Bayesian thinks the key could be in several places and assigns a probability to each spot, updating those
probabilities as they search.
Prior Probability Distribution (Prior): This is your belief about a parameter before you see any data. It's your starting
point or your initial guess. For example, "I think there's a 60% chance my favorite team will win tonight."
Likelihood Function (Likelihood): This is the new information you get from your data. It tells you how likely it is to see this
data, given a certain value of the parameter. For example, "The data shows my team has won 8 of their last 10 games."
Posterior Probability Distribution (Posterior): This is your updated belief after you combine your prior with the likelihood
from the data. It's the grand finale!
Diagram:
Combine with..
🎯 Posterior Belief(Your
new, updated guess)
Quiz Question: If your "prior" is your initial guess, what is your "posterior"?
Answer: Your posterior is your updated guess after you've looked at the evidence or data.
Explanation: It's like starting with a hunch (prior), then gathering clues (likelihood), and finally forming a more educated
conclusion (posterior).
The two most popular MCMC algorithms are the Metropolis-Hastings algorithm and the Gibbs Sampling algorithm. The
software we'll use employs Metropolis-Hastings.
A key thing to remember is that MCMC can be sensitive to where you start. If you start at the shallow end of the pool, your
first few steps might be a bit wild. That's why we sometimes use a good starting guess, like the one we got from the MLE
method in the last lesson.
Quiz Question: Why do we use methods like MCMC to sample from a posterior distribution?
Answer: Because the posterior distribution's formula is often too complex to work with directly. Sampling gives us an
approximate picture of it.
Explanation: It's like trying to understand a giant, complex painting. Instead of trying to describe the whole thing at once, you
take thousands of tiny photo samples from all over the canvas. By looking at all the samples, you can piece together what the
whole painting looks like.
ν − 2
2
rt = ϵt √ ωt σt
ν
The big posterior formula we are trying to understand looks like this:
l(r t |α 0 , α 1 , β, ν, ω t ) p(α 0 , α 1 , β, ν, ω t )
p(α 0 , α 1 , β, ν, ω t |r t ) =
p(r t )
Now, let's use the Python package PyMC to estimate the GARCH(1,1) model for Google's stock returns.
PYTHON
# starting parameters = blank canvas model(0.000001, 0.000001, 0.000001)
alpha_mu = shared(np.array([0.000001, 0.000001], dtype=np.float64))
alpha_sigma = shared(np.array([[1000.0, 0.0], [0.0, 1000.0]], dtype=np.float64))
beta_mu = shared(np.array(0.000001, dtype=np.float64))
beta_sigma = shared(np.array(1000.0, dtype=np.float64))
ivolatility = shared(np.array(0.000001, dtype=np.float64))
ivolatility_vol = shared(np.array(10.0, dtype=np.float64))
We'll generate two sample series, called chains. Why two? It's like sending two blindfolded friends into the pool. If they both
come back with similar maps of the deep end, we can be more confident they found the right spot. This check is called
convergence diagnostics.
PYTHON
# Plot first round MCMC model posteriors
with mcmc0:
trace_mcmc0 = pm.sample(3000, cores=2, step=pm.Slice(), tune=0, return_inferencedata=True, random_seed=12345)
az.plot_trace(trace_mcmc0, var_names=["alpha0", "alpha1", "beta"],
lines=[("alpha0", {}, [0.124993]),("alpha1", {}, [0.082160]),("beta", {}, [0.867127])],
compact=False, legend=True, figsize=(16, 7))
plt.tight_layout()
plt.show()
Trace Plots (Right side of the plot): These show the journey of each chain for each parameter (α , α , β). You'd see two 0 1
lines (one for each chain) zig-zagging. In this first run, the lines would look pretty wild at the beginning before settling
down. This suggests the sampler took a while to find the "deep end" of the pool.
Density Plots (Left side of the plot): This is the final map! It shows a smooth curve representing the posterior distribution
(our updated belief) for each parameter.
alpha0 0.145094 0.053294 0.059351 0.262754 0.007721 0.004627 47.599936 51.832998 1.068832
alpha1 0.089880 0.025170 0.048496 0.141094 0.003876 0.002370 47.796746 48.588985 1.066640
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta 0.852806 0.042358 0.757265 0.916286 0.006640 0.004178 44.814853 48.402712 1.074747
The most important column here is the last one: r_hat . This is the Gelman-Rubin statistic. It checks if our chains have
"converged" or agree with each other. A value close to 1.0 is great. A value above 1.1 or 1.2 is a red flag 🚩 that something's
wrong. Our values are around 1.07, which isn't perfect.
PYTHON
# starting parameters = MLE(0.124993, 0.082160, 0.867127)
alpha_mu = shared(np.array([0.124993, 0.082160], dtype=np.float64))
beta_mu = shared(np.array(0.867127, dtype=np.float64))
ivolatility = shared(np.array(1.63865, dtype=np.float64))
# ... (rest of the model setup is similar)
After running the MCMC again with these new starting values, the trace plots would look much more stable from the very
beginning. Let's check the new r_hat values.
alpha0 0.151440 0.091908 0.067844 0.240691 0.007156 0.043919 82.938764 221.227579 1.029334
alpha1 0.092149 0.024706 0.054129 0.132585 0.002455 0.002861 92.183544 318.266329 1.024602
beta 0.848298 0.051242 0.782080 0.913977 0.004892 0.015678 75.874103 231.988360 1.030617
Much better! The r_hat values are closer to 1.0. This shows that a good starting point can really help.
Let's run the model one last time, telling it to discard the first 250 samples from each chain.
PYTHON
with mcmc:
# ...
trace_mmc_a = pm.sample(3000, cores=2, step=pm.Slice(), tune=250, # tune=250 means a 250-step burn-in
return_inferencedata=True, random_seed=12345)
# ... (plotting code)
The final trace plots should now look very stable and consistent, like a fuzzy caterpillar 🐛. The density plots will give us our
final answer for what the parameters likely are.
alpha0 0.143620 0.043415 0.074571 0.231585 0.004067 0.002347 108.231247 235.146465 1.008775
alpha1 0.090170 0.019715 0.057861 0.128158 0.001783 0.001058 119.996201 301.232062 1.003836
beta 0.852987 0.032861 0.786890 0.907703 0.003265 0.001904 97.762053 211.792705 1.007301
Success! 🎉 All the r_hat values are very close to 1.0. This gives us confidence that our MCMC simulation worked well and our
results are reliable.
Final Comparison
How do our final Bayesian estimates compare to the MLE method from the last lesson?
The results are quite similar! This is a good sign. Using the MLE results as starting values helped guide our Bayesian model to
a comparable and reliable solution.
4. Conclusion
Wow, you made it! In this lesson, we took a deep dive into the world of Bayesian statistics.
Bayesian statistics is an incredibly powerful tool that's becoming more and more important, especially in fields like machine
learning. You've taken a huge step in understanding how modern data detectives work! Keep exploring! 🌟
LESSON 4
Required Reading
Hello! 🕵️♂️ Get ready to become a data detective as we dive into State-Space Models and the super-cool Kalman Filter. Think
of it like trying to track a hidden submarine. You can't see the sub itself (that's the hidden state), but you get blips on your
sonar every few minutes (those are your measurements). Our goal is to use those blips to figure out exactly where the sub is
and where it's going next! 🚀
State-Space Models
In this lesson, we're looking at a powerful tool called state-space models. They're used a ton in economics, especially for
tracking things that we can't see directly.
yt = μt + ϵt
μ t = μ t−1 + η t
The big challenge is that we only see y . We have to play detective to figure out the secret trend, μ !
t t
yt = β0 + β1 St + ϵt
This just means that when we're in a boom (S = 1), the growth is higher. We also have probabilities that describe how we
t
switch between states (e.g., the chance of going from a bust to a boom). Again, we only see the final GDP growth, y , not the t
secret state S . t
1. Estimating Parameters: Figuring out the values of things like the amount of noise (σ and σ ). 2
ϵ
2
η
2. Extracting the State: Finding the value of the hidden state (like estimating the true trend μ ). t
1. State Equation: Describes how the hidden state, α , changes over time. It's the rulebook for the secret submarine's
t
movement.
2. Measurement Equation: Describes how the clues we see, y , are related to the hidden state. It explains how the sonar
t
Using these two equations, we can figure out the probability (or "likelihood") of seeing the data we've collected. This process
of using the data to update our beliefs about the hidden state is called filtering.
1. Combine Clues: Figure out the probability of your next measurement based on all past clues.
2. Predict: Guess where the hidden state will be before you get your next clue.
3. Update: Use your new clue to update your guess about the hidden state.
1. Predict: You know your friend was in the food court. You guess they're now heading towards the movie theater. (f (α t |Y t−1 )
)
2. Get a Clue: Your friend texts you, "Just passed the shoe store!" (y ) t
3. Update: You combine your prediction with this new clue. Now you have a much better idea of exactly where they are. (
f (α |Y ))
t t
This process can be mathematically tricky because of some complex math called integration. But, there are two cases where
it's easy:
When we have normal distributions, we can use a super-powerful tool called the Kalman Filter!
Quiz Question: What are the two main equations in a state-space model, and what do they represent?
Answer: The State Equation and the Measurement Equation.
Explanation: The State Equation is the rulebook for how the hidden thing you're tracking moves (e.g., a submarine's path).
The Measurement Equation explains how the clues you see are related to that hidden thing (e.g., how the submarine creates
sonar blips).
Kalman Filtering
The Kalman Filter is the perfect tool for state-space models when all the random noise is "normal" (follows a bell curve).
α t = T α t−1 + Rη t
y t = Zα t + Sξ t
T , R, Z, S : These are just matrices (collections of numbers) that define the rules of the game. For example, T describes
how the sub moves from one moment to the next.
Since everything is based on normal distributions, all our predictions and updates will also be normal distributions. A normal
distribution is defined by just two things: its mean (center) and its variance (spread). So, the Kalman filter is just a set of
recipes for calculating these means and variances!
Diagram:
Updated Belief at t-1
αt-1, t-1, Pt-1, t-1
Predict Step
Update Step
Updated Belief at t
αt, t, Pt, t
This loop runs forward in time, getting better and better at tracking the hidden state with each new piece of information.
Quiz Question: The Kalman filter is a recipe for calculating what two numbers at each step?
Answer: The mean and the variance.
Explanation: Because we assume everything follows a normal (bell curve) distribution, we only need to keep track of the
center of the curve (the mean) and how wide it is (the variance). The Kalman filter just tells us how to update these two
numbers as new clues come in.
Kalman Smoother
The Kalman filter is great for real-time tracking, as it uses all information up to the present moment. But what if you have the
complete set of data for an entire year and want the absolute best estimate for the hidden state back in June?
For this, we use the Kalman Smoother. After the Kalman filter runs all the way to the end (forward in time), the smoother runs
backward from the end to the beginning. At each step, it revises the estimate using information from the future, making it
even more accurate.
Summary
For a state-space model, the Kalman filter is a superstar that can:
For ARMA models, it's almost impossible to calculate the likelihood without the Kalman filter, making it a truly essential tool!
Quiz Question: You are using the Kalman filter to track a satellite, but a solar flare makes you lose the signal for an hour. What
does the filter do during that hour?
Answer: It keeps running its "Predict" step, using the satellite's last known trajectory to estimate its path. It temporarily skips
the "Update" step until a new signal is received.
Explanation: The filter doesn't just give up! It uses its internal model of how the satellite moves to coast through the missing
data period, making the best guess possible until it gets new information.
Main Lesson 4
A State Space Model does exactly this! It helps us understand a hidden state by looking at the observations it produces. It
uses two main equations to do this.
Observation Equation: This equation connects the hidden state to what we can actually see.
yt = At xt + vt , v t ∼ W N (0, V t )
State Equation: This equation describes how the hidden state changes over time.
x t = Φ t x t−1 + w t , w t ∼ W N (0, W t )
Diagram:
Here’s how the two parts are connected:
System
State at t=0
State Equation
State at t=1
State Equation
Observation Equation
Observations
If they can change over time (they have the little t), it's a time-varying model.
If they are fixed and don't change, it's a time-invariant model.
The Three Goals of SSM 🎯
Once we build a model, we want to use it for:
a. Predicting: Guessing the future state (What will my friend's mood be tomorrow?).
b. Filtering: Estimating the current state using all info up to today (What is my friend's mood right now based on the text
they just sent?).
c. Smoothing: Revising our estimate of a past state using info we got later (Looking back, what was their mood yesterday,
knowing what they texted today?).
Quiz Question: In a State Space Model, which variable can you directly see and measure: the "state" or the "observation"?
Answer: The observation.
Explanation: The observation (y ) is the data we can collect, like a stock price or a text message. The state (x ) is the hidden
t t
underlying factor, like market volatility or a person's mood, that we are trying to estimate.
a. x is a Markov Process
t
This is a fancy way of saying the state is memoryless. The future state only depends on the current state, not the entire
history of how it got there.
Analogy: In a video game, your character's next action depends on their current health and position, not every single
move you made since the start of the game.
In math terms: P (x t |x t−1 , x t−2 , ⋯ , x 0 ) = P (x t |x t−1 )
This means that if you magically knew the exact hidden state at time t, you'd know everything you need to know about the
observation at time t (except for some random noise).
Analogy: If you knew your friend was ecstatically happy (the state), you could predict they would send a text full of happy
emojis (the observation).
i=1
This just means we can figure out the probability of everything happening by combining the probability of the starting state,
how states change, and how observations are made.
Less Data Needed: Because of the Markov "memoryless" property, we don't need huge amounts of historical data.
Flexible Rules: They allow coefficients to change over time, which is great for modeling real-world situations where things
aren't always constant (like before and after a big financial crisis).
Handles Missing Values: They can cleverly fill in the gaps if some of your data is missing.
Models Complex Systems: Great for modeling systems with many moving parts (multivariate systems).
Works with Non-Stationary Data: Can be used on time series that don't have a constant mean or variance.
2. Kalman Filter 🤖
The Kalman Filter is the most famous type of State Space Model. It's a linear model that assumes the random noise is normal
(Gaussian). It's incredibly popular because it's efficient and doesn't require tons of computing power.
Observation Equation:
′
yt = F xt + vt , v t ∼ N (0, V )
State Equation:
x t = Gx t−1 + w t , w t ∼ N (0, W )
P (y t |x t )P (x t |D t−1 )
P (x t |y t , D t−1 ) =
P (y t )
Posterior (P (x t
): Our updated belief about the state, after seeing today's observation.
|y t , D t−1 )
Prior (P (x t |D t−1 ) ): Our prior belief about the state, before seeing today's observation.
This process gives us a set of results called the Kalman filter recursion. The most important result is how to find the mean (
m ) and variance (C ) of the state today.
t t
mt = at + Kt et
This is a recursive method—it only needs the result from the last period (t − 1) and the new data from today (t) to work. No
need to re-process all the old data!
Quiz Question: What does the Kalman Gain (K ) do in the Kalman Filter?
t
Answer: It acts as a tuning factor to decide how much weight to put on the new observation versus the previous forecast.
Explanation: A high Kalman Gain means the new observation is very influential, causing a big update to the state estimate. A
low gain means the new observation has less impact.
Forecast State (x t+1 ): We use the transition matrix G on our current best estimate m . t
For more than one step ahead (d > 1), we just repeat the process recursively!
yt = F xt + zt
z t = s t ϵ t , where ϵ t ∼ N (0, 1)
2 2 2
s t = α 0 + α 1 z t−1 + β 1 s t−1
GARCH!
The main difference this makes is in the one-step forecast for the observation's variance:
′ 2
Qt = F Rt F + st
Instead of a constant noise variance V , we now have s , the time-varying variance from the GARCH model.
2
t
To estimate all the unknown parameters in this combined model, a special technique called Forward Filtering and Backward
Sampling (FFBS) is often used.
1. Choose an asset: Pick from Google stock returns, EUR/USD exchange rates, or Treasury bond yields.
2. Choose an error distribution: Select normal, Student's t, etc., for your GARCH model.
3. View Results:
See plots of the asset's returns (time series, histogram, QQ plot).
See plots of the squared returns to check for volatility clustering (ACF and PACF plots).
Get the final GARCH(1,1) model results and diagnostic checks.
5. Conclusion ✅
In this lesson, we journeyed through the powerful world of State Space Models. We learned the basics, saw their properties
and benefits, and dove into the famous Kalman Filter. We then successfully combined the Kalman Filter with a GARCH(1,1)
model to handle changing volatility in a sophisticated way.