0% found this document useful (0 votes)

26 views6 pages

EE675 Lecture 5

The document discusses Thompson sampling, an algorithm that balances exploration and exploitation for multi-armed bandit problems. It can be applied to problems with different reward distributions by modeling the distribution of each arm. The key points are: 1) Thompson sampling works by sampling from the estimated reward distribution of each arm and selecting the arm with highest sample. 2) It provides regret bounds that quantify its performance compared to an optimal strategy. 3) The algorithm and its regret bounds are explained for Bernoulli and Gaussian reward distributions, and how it can be extended to general distributions.

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views6 pages

EE675 Lecture 5

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

EE675A - IITK, 2022-23-II

Lecture 5: Thompson Sampling for General distributions

23th January 2023
Lecturer: Subrahmanya Swamy Peruru Scribe: Saurabh | Onkar Dasari

1 Introduction
Thompson Sampling is a widely used algorithm in the ﬁeld of Machine Learning and Reinforce-
ment Learning. It is a probabilistic approach that provides a way to balance exploration and ex-
ploitation by selecting actions based on the estimated distribution of rewards. The algorithm can be
applied to various distributions, making it a general method for sequential decision-making prob-
lems. In addition, it has been shown to have regret bounds, which quantify the algorithm’s perfor-
mance compared to an optimal strategy. This lecture aims to explore the concepts of Thompson
Sampling for general distributions and the associated regret bounds.

1.1 Recap from last lecture

The last lecture focused on Upper Conﬁdence Bound (UCB) and Thompson Sampling for prior
Beta distributions. The UCB algorithm was given as a way to balance exploration and exploitation
by adding a conﬁdence constraint to the expected reward of each action.

Basic idea of UCB:

1. Play each arm a ∈ A once.

2. For each round t, Play with arm a(t) ∈ A suchq
that a(t) = argmaxa U CBt (a),
2logT
where U CBt (a) = µ̄t (a) + ϵt (a) and ϵt (a) = nt (a)

Thompson Sampling was then introduced as a Bayesian alternative to UCB, where the algo-
rithm samples from the posterior distribution of the rewards for each action. Both methods were
shown to be effective in balancing exploration and exploitation, but Thompson Sampling was
shown to have better performance in some cases, especially when dealing with prior Beta distribu-
tions.

Basic idea of Thompson sampling:

1. At t = 0, we have prior Beta distributions for µ(a) for each arm a ∈ A as: B(α0 (a), β0 (a))
2. For(t = 1 to T ) do the following:

(a) For each arm a ∈ A, sample θet (a) ∼ B(αt−1 (a), βt−1 (a))

1
(b) Play arm a(t) = {θet (a)}
a
(c) Update the posterior of the arm a(t) based on the reward rt ∈ {0, 1} received in round
t: αt (a(t)) = αt−1 (a(t)) + rt , βt (a(t)) = βt−1 (a(t)) + (1 − rt ).

The lecture concluded with a demonstration of how these algorithms can be used to solve the
multi-armed bandit problem and the differences in their exploration-exploitation trade-off.

2 Thompson Sampling
2.1 For Bernoulli Rewards and Beta Prior
In the case of Bernoulli rewards, the probability of success for each arm is modeled as a Bernoulli
random variable, taking values of 0 or 1 with a certain probability of success, denoted by θ. The
prior belief over θ for each arm is modeled as a Beta distribution with parameters α and β, repre-
senting the number of successes and failures, respectively, before observing any data.3

A Beta prior is commonly used when the rewards are binary and follow a Bernoulli distribution.

Prior : {Beta(α0 (a), β0 (a))}a∈A

Algorithm 1 :
for each arm a ∈ A do
Sample θet (a) from B(αt−1 (a), βt−1 (a))
Play argmaxa θet (a)
Update posterior for arm a(t) = a to Beta(αt−1 (a) + r, βt−1 (a) + 1 − r)

Where ’r’ is the reward observed and ’Posterior’ refers to the probability distribution over the pa-
rameters of a Beta distribution, updated with new data, that represents our current beliefs about
these parameters after taking into account the observed data.

2.2 For Gaussian Reward

Reward Distribution = N (µ(a), 1)

Prior N(0, 1) = P0 (µ(a) = θ)

When the reward follows a Gaussian distribution and the prior is also a Gaussian distribution, the
posterior distribution is also a Gaussian distribution, with an updated mean and variance that reﬂect
the new information from the observed data. This is known as Bayes’ theorem, which states that
the posterior is proportional to the product of the prior and the likelihood.

2
1
Pt (µ(a) = θ) = N(µ̄t(a) , )
nt (a) + 1
By having a Gaussian posterior, the Thompson Sampling algorithm can sample from the posterior
distribution efﬁciently and make decisions in real-time without having to perform computationally
intensive numerical methods.

Thompson Sampling for Gaussian reward and N(0,1) prior is a Bayesian approach to solve multi-
arm bandit problems. In this case, the reward for each arm is assumed to be Gaussian distributed
and the prior belief over the expected reward for each arm is modeled as a normal distribution with
mean 0 and standard deviation 1.

Algorithm 2 :
Set µ0 (a) = 0 ∀ a∈ A
for each t ≥ 1 do
for each arm a ∈ A do
Sample θet (a) from N (µ̄t−1 (a), nt−1 1(a)+1 )
Play a(t) = argmaxa θet (a)
If a(t) = a, update µ̄t (a) based on reward observed

2.3 For General Distributions

Applying Thompson Sampling algorithm to Gaussian distributions only requires the number of
trials for an arm "a", the calculated mean of arm "a", and the Gaussian distribution assumption.
However, if we ignore the fact that the algorithm was designed specifically for Gaussian distribu-
tions and try to apply it to other general distributions, additional steps may be necessary.This is
possible, but we must first define a few terms before stating the regret bounds.

2.3.1 Environment
A bandit environment for a general case is deﬁned by the distributions of all the arms, denoted as
{Da }a∈A where Da represents the underlying distribution of arm a. This means that the environ-
ment is fully speciﬁed by the information about the reward distributions for each arm.

For a Bernoulli case we only need µ(a) to specify the distributions so the environment can be
speciﬁed completely by {µa }a∈A .

3
For a general Gaussian case, it will be Env = {µa , σ(a)}a∈A and if we assume unit variance it
becomes Env = {µa }a∈A .

2.3.2 KL-Divergence(Kullback-Leibler Divergence)

Kullback-Leibler Divergence (KL-Divergence) is a measure of the difference between two prob-
ability distributions. It is a non-symmetric and non-negative metric that quantiﬁes the amount of
information lost when approximating one distribution with another. KL-Divergence is commonly
used to measure the distance between the true distribution of rewards and the estimated distribution
in a bandit algorithm.

1. For a discrete case :

Let P and Q be two probability distributions deﬁned on the same sample space Ω . Then
X P (x)
DKL (P ∥ Q) = P (x) log
x∈Ω
Q(x)

2. For a continuous case :

Let f (x) and g(x) be two probability distribution functions on the same sample space Ω .
Then
Z
P (x)
DKL (f ∥ g) = f (x) log dx
x∈Ω Q(x)

2.3.3 KL-Divergence of Bernoulli Distribution

Let Ber(µa ) and Ber(µb ) be two Bernoulli Distributions . Now we will caculate their KL-Divergence.
Denote KL(Ber(µa ) ∥ Ber(µb ) as KLBer (µa , µb ). We will use the following result2 to get the re-
quired inequality.
1
KL(fa ∥ fb ) = (µa − µb )2
2
where fa (x) and fb (x) be distributions with unit variance.

(µa − µb )2
=⇒ 2(µa − µb )2 ≤ KL(µa ∥ µb ) ≤
µb (1 − µb )

1
∵ 0 ≤ µb ≤ 1 =⇒ µb (1 − µb ) ≤
4

4
put ∆ = (µa − µb ) and max (µa , µb ) = µ∗

=⇒ 2∆2 (a) ≤ KL(µ(a), µ∗ ) ≤ 4∆2 (a)

∴ KL(µ(a), µ∗ ) ∼ O∆2 (a)

The above eqaution gives us an estimate on KL-Divergence of Bernoulli Distributions.

3 Regret For Thompson Sampling

3.1 For Bernoulli Reward, Beta Prior
For any Bernoulli Env = µ(a)a∈A , the expected regret satisﬁes:

X ∆(a)
E[R(T ; env)] ≤ O(logT )
a̸=a∗
KL(µ(a), µ∗

For any Bernoulli environment.1

Since KL(µ(a), µ∗ ) ∼ O(∆2 (a)), the above bound gives

O(KlogT )
E[R(T ; env)] ≤
∆
Where ∆ = mina ∆(a)

Similarly, instance-dependent bound of Thompson Sampling (For Bernoulli Case) is

R(T ) = maxenv∈{Bernoulli} R(T, env)

Satisﬁes the following bound

p
R(T ) ≤ O( KT logT )

Similar regret bounds can be shown for Gaussian Thompson Sampling when rewards are Gaussian
distributed.

5
3.2 Bayesian Regret

We have motivated TS by stating that it is extremely advantageous to have prior knowledge

of the underlying true means. If we already know which true mean values are more likely to be
encountered, it makes little sense to develop algorithms that function in all potential settings. It
makes sense to evaluate the performance of an algorithm with an emphasis on the most probable
settings (based on prior distribution knowledge).

References
1
S. Agrawal. Multi-armed bandits and reinforcement learning lecture 5.
2
R. S. S. A. G. Barto. Reinforcement Learning: An Introduction. Bradford Book, 2018.
3
O. Chapelle and L. Li. An empirical evaluation of thompson sampling. Advances in neural
information processing systems, 24, 2011.

Regret Bounds in Thompson Sampling
No ratings yet
Regret Bounds in Thompson Sampling
3 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Master Thesis On Mixed Model Bandits
No ratings yet
Master Thesis On Mixed Model Bandits
73 pages
Department of Management Science and Engineering, Stanford University, Department of Management Science and Engineering, Stanford University
No ratings yet
Department of Management Science and Engineering, Stanford University, Department of Management Science and Engineering, Stanford University
38 pages
Agrawal&Goyal 2013
No ratings yet
Agrawal&Goyal 2013
9 pages
Thompson Sampling-Like Algorithms For Stochastic Rising Bandits
No ratings yet
Thompson Sampling-Like Algorithms For Stochastic Rising Bandits
57 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
EXP3
No ratings yet
EXP3
36 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
Stacked Thompson Bandits: Lenz Belzner Thomas Gabor
No ratings yet
Stacked Thompson Bandits: Lenz Belzner Thomas Gabor
4 pages
Reinforcement Learning Algorithms
No ratings yet
Reinforcement Learning Algorithms
12 pages
Scalable Thompson Sampling Via Optimal Transport
No ratings yet
Scalable Thompson Sampling Via Optimal Transport
9 pages
Book PDF
No ratings yet
Book PDF
582 pages
Unit - 1: Probability Linear Algebra
No ratings yet
Unit - 1: Probability Linear Algebra
20 pages
Mod7 Slides
No ratings yet
Mod7 Slides
44 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
CS 747, Autumn 2023 - Lecture 3
No ratings yet
CS 747, Autumn 2023 - Lecture 3
27 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Lattimore Szepesvari18bandit Algorithms PDF
No ratings yet
Lattimore Szepesvari18bandit Algorithms PDF
513 pages
Multi-Arm-Bandit Problem
No ratings yet
Multi-Arm-Bandit Problem
11 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Assign 1
No ratings yet
Assign 1
5 pages
Bayesian RL for Researchers
No ratings yet
Bayesian RL for Researchers
27 pages
Machine Learning Estimation Guide
No ratings yet
Machine Learning Estimation Guide
6 pages
Studio 5 Questions
No ratings yet
Studio 5 Questions
8 pages
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
No ratings yet
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
49 pages
HW 2
No ratings yet
HW 2
3 pages
Thompson Sampling For Contextual Bandits With Linear Payoffs
No ratings yet
Thompson Sampling For Contextual Bandits With Linear Payoffs
22 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
qt9kb6x0bw Nosplash
No ratings yet
qt9kb6x0bw Nosplash
18 pages
ConcaveBandits ICML2025
No ratings yet
ConcaveBandits ICML2025
19 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
CSE3011 Reinforcement Learning Credit Structure: 2-2-3
No ratings yet
CSE3011 Reinforcement Learning Credit Structure: 2-2-3
50 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
MCQ& FB - Unit 1
No ratings yet
MCQ& FB - Unit 1
9 pages
Statistics - Lecture 7
No ratings yet
Statistics - Lecture 7
47 pages
Intro Bayes Time Series 1
No ratings yet
Intro Bayes Time Series 1
72 pages
On Kernelized Multi-Armed Bandits: Sayak Ray Chowdhury
No ratings yet
On Kernelized Multi-Armed Bandits: Sayak Ray Chowdhury
26 pages
Simon Shaw Bayes Theory
No ratings yet
Simon Shaw Bayes Theory
72 pages
Bandits
No ratings yet
Bandits
2 pages
An Analysis of Multi-Armed Bandit Algorithms
No ratings yet
An Analysis of Multi-Armed Bandit Algorithms
6 pages
Monte Carlo Endsem Report
No ratings yet
Monte Carlo Endsem Report
2 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
Statistics 202C Study Guide: Part I: Sampling Basic Unstructured Distributions and Monte Carlo Basics
No ratings yet
Statistics 202C Study Guide: Part I: Sampling Basic Unstructured Distributions and Monte Carlo Basics
14 pages
W10 Notes
No ratings yet
W10 Notes
2 pages
Week 10
No ratings yet
Week 10
2 pages
Lion 5 Paper
No ratings yet
Lion 5 Paper
15 pages
Assignment 2 - Solution
No ratings yet
Assignment 2 - Solution
5 pages

EE675 Lecture 5

Uploaded by

EE675 Lecture 5

Uploaded by

EE675A - IITK, 2022-23-II

Lecture 5: Thompson Sampling for General distributions

1.1 Recap from last lecture

Basic idea of UCB:

1. Play each arm a ∈ A once.

Basic idea of Thompson sampling:

Prior : {Beta(α0 (a), β0 (a))}a∈A

2.2 For Gaussian Reward

Reward Distribution = N (µ(a), 1)

2.3 For General Distributions

2.3.2 KL-Divergence(Kullback-Leibler Divergence)

1. For a discrete case :

2. For a continuous case :

2.3.3 KL-Divergence of Bernoulli Distribution

=⇒ 2∆2 (a) ≤ KL(µ(a), µ∗ ) ≤ 4∆2 (a)

∴ KL(µ(a), µ∗ ) ∼ O∆2 (a)

The above eqaution gives us an estimate on KL-Divergence of Bernoulli Distributions.

3 Regret For Thompson Sampling

For any Bernoulli environment.1

Since KL(µ(a), µ∗ ) ∼ O(∆2 (a)), the above bound gives

Similarly, instance-dependent bound of Thompson Sampling (For Bernoulli Case) is

R(T ) = maxenv∈{Bernoulli} R(T, env)

Satisﬁes the following bound

We have motivated TS by stating that it is extremely advantageous to have prior knowledge

You might also like