EE675A - IITK, 2022-23-II
Lecture 5: Thompson Sampling for General distributions
23th January 2023
Lecturer: Subrahmanya Swamy Peruru Scribe: Saurabh | Onkar Dasari
1 Introduction
Thompson Sampling is a widely used algorithm in the field of Machine Learning and Reinforce-
ment Learning. It is a probabilistic approach that provides a way to balance exploration and ex-
ploitation by selecting actions based on the estimated distribution of rewards. The algorithm can be
applied to various distributions, making it a general method for sequential decision-making prob-
lems. In addition, it has been shown to have regret bounds, which quantify the algorithm’s perfor-
mance compared to an optimal strategy. This lecture aims to explore the concepts of Thompson
Sampling for general distributions and the associated regret bounds.
1.1 Recap from last lecture
The last lecture focused on Upper Confidence Bound (UCB) and Thompson Sampling for prior
Beta distributions. The UCB algorithm was given as a way to balance exploration and exploitation
by adding a confidence constraint to the expected reward of each action.
Basic idea of UCB:
1. Play each arm a ∈ A once.
2. For each round t, Play with arm a(t) ∈ A suchq
that a(t) = argmaxa U CBt (a),
2logT
where U CBt (a) = µ̄t (a) + ϵt (a) and ϵt (a) = nt (a)
Thompson Sampling was then introduced as a Bayesian alternative to UCB, where the algo-
rithm samples from the posterior distribution of the rewards for each action. Both methods were
shown to be effective in balancing exploration and exploitation, but Thompson Sampling was
shown to have better performance in some cases, especially when dealing with prior Beta distribu-
tions.
Basic idea of Thompson sampling:
1. At t = 0, we have prior Beta distributions for µ(a) for each arm a ∈ A as: B(α0 (a), β0 (a))
2. For(t = 1 to T ) do the following:
(a) For each arm a ∈ A, sample θet (a) ∼ B(αt−1 (a), βt−1 (a))
1
(b) Play arm a(t) = {θet (a)}
a
(c) Update the posterior of the arm a(t) based on the reward rt ∈ {0, 1} received in round
t: αt (a(t)) = αt−1 (a(t)) + rt , βt (a(t)) = βt−1 (a(t)) + (1 − rt ).
The lecture concluded with a demonstration of how these algorithms can be used to solve the
multi-armed bandit problem and the differences in their exploration-exploitation trade-off.
2 Thompson Sampling
2.1 For Bernoulli Rewards and Beta Prior
In the case of Bernoulli rewards, the probability of success for each arm is modeled as a Bernoulli
random variable, taking values of 0 or 1 with a certain probability of success, denoted by θ. The
prior belief over θ for each arm is modeled as a Beta distribution with parameters α and β, repre-
senting the number of successes and failures, respectively, before observing any data.3
A Beta prior is commonly used when the rewards are binary and follow a Bernoulli distribution.
Prior : {Beta(α0 (a), β0 (a))}a∈A
Algorithm 1 :
for each arm a ∈ A do
Sample θet (a) from B(αt−1 (a), βt−1 (a))
Play argmaxa θet (a)
Update posterior for arm a(t) = a to Beta(αt−1 (a) + r, βt−1 (a) + 1 − r)
Where ’r’ is the reward observed and ’Posterior’ refers to the probability distribution over the pa-
rameters of a Beta distribution, updated with new data, that represents our current beliefs about
these parameters after taking into account the observed data.
2.2 For Gaussian Reward
Reward Distribution = N (µ(a), 1)
Prior N(0, 1) = P0 (µ(a) = θ)
When the reward follows a Gaussian distribution and the prior is also a Gaussian distribution, the
posterior distribution is also a Gaussian distribution, with an updated mean and variance that reflect
the new information from the observed data. This is known as Bayes’ theorem, which states that
the posterior is proportional to the product of the prior and the likelihood.
2
1
Pt (µ(a) = θ) = N(µ̄t(a) , )
nt (a) + 1
By having a Gaussian posterior, the Thompson Sampling algorithm can sample from the posterior
distribution efficiently and make decisions in real-time without having to perform computationally
intensive numerical methods.
Thompson Sampling for Gaussian reward and N(0,1) prior is a Bayesian approach to solve multi-
arm bandit problems. In this case, the reward for each arm is assumed to be Gaussian distributed
and the prior belief over the expected reward for each arm is modeled as a normal distribution with
mean 0 and standard deviation 1.
Algorithm 2 :
Set µ0 (a) = 0 ∀ a∈ A
for each t ≥ 1 do
for each arm a ∈ A do
Sample θet (a) from N (µ̄t−1 (a), nt−1 1(a)+1 )
Play a(t) = argmaxa θet (a)
If a(t) = a, update µ̄t (a) based on reward observed
2.3 For General Distributions
Applying Thompson Sampling algorithm to Gaussian distributions only requires the number of
trials for an arm "a", the calculated mean of arm "a", and the Gaussian distribution assumption.
However, if we ignore the fact that the algorithm was designed specifically for Gaussian distribu-
tions and try to apply it to other general distributions, additional steps may be necessary.This is
possible, but we must first define a few terms before stating the regret bounds.
2.3.1 Environment
A bandit environment for a general case is defined by the distributions of all the arms, denoted as
{Da }a∈A where Da represents the underlying distribution of arm a. This means that the environ-
ment is fully specified by the information about the reward distributions for each arm.
For a Bernoulli case we only need µ(a) to specify the distributions so the environment can be
specified completely by {µa }a∈A .
3
For a general Gaussian case, it will be Env = {µa , σ(a)}a∈A and if we assume unit variance it
becomes Env = {µa }a∈A .
2.3.2 KL-Divergence(Kullback-Leibler Divergence)
Kullback-Leibler Divergence (KL-Divergence) is a measure of the difference between two prob-
ability distributions. It is a non-symmetric and non-negative metric that quantifies the amount of
information lost when approximating one distribution with another. KL-Divergence is commonly
used to measure the distance between the true distribution of rewards and the estimated distribution
in a bandit algorithm.
1. For a discrete case :
Let P and Q be two probability distributions defined on the same sample space Ω . Then
X P (x)
DKL (P ∥ Q) = P (x) log
x∈Ω
Q(x)
2. For a continuous case :
Let f (x) and g(x) be two probability distribution functions on the same sample space Ω .
Then
Z
P (x)
DKL (f ∥ g) = f (x) log dx
x∈Ω Q(x)
2.3.3 KL-Divergence of Bernoulli Distribution
Let Ber(µa ) and Ber(µb ) be two Bernoulli Distributions . Now we will caculate their KL-Divergence.
Denote KL(Ber(µa ) ∥ Ber(µb ) as KLBer (µa , µb ). We will use the following result2 to get the re-
quired inequality.
1
KL(fa ∥ fb ) = (µa − µb )2
2
where fa (x) and fb (x) be distributions with unit variance.
(µa − µb )2
=⇒ 2(µa − µb )2 ≤ KL(µa ∥ µb ) ≤
µb (1 − µb )
1
∵ 0 ≤ µb ≤ 1 =⇒ µb (1 − µb ) ≤
4
4
put ∆ = (µa − µb ) and max (µa , µb ) = µ∗
=⇒ 2∆2 (a) ≤ KL(µ(a), µ∗ ) ≤ 4∆2 (a)
∴ KL(µ(a), µ∗ ) ∼ O∆2 (a)
The above eqaution gives us an estimate on KL-Divergence of Bernoulli Distributions.
3 Regret For Thompson Sampling
3.1 For Bernoulli Reward, Beta Prior
For any Bernoulli Env = µ(a)a∈A , the expected regret satisfies:
X ∆(a)
E[R(T ; env)] ≤ O(logT )
a̸=a∗
KL(µ(a), µ∗
For any Bernoulli environment.1
Since KL(µ(a), µ∗ ) ∼ O(∆2 (a)), the above bound gives
O(KlogT )
E[R(T ; env)] ≤
∆
Where ∆ = mina ∆(a)
Similarly, instance-dependent bound of Thompson Sampling (For Bernoulli Case) is
R(T ) = maxenv∈{Bernoulli} R(T, env)
Satisfies the following bound
p
R(T ) ≤ O( KT logT )
Similar regret bounds can be shown for Gaussian Thompson Sampling when rewards are Gaussian
distributed.
5
3.2 Bayesian Regret
We have motivated TS by stating that it is extremely advantageous to have prior knowledge
of the underlying true means. If we already know which true mean values are more likely to be
encountered, it makes little sense to develop algorithms that function in all potential settings. It
makes sense to evaluate the performance of an algorithm with an emphasis on the most probable
settings (based on prior distribution knowledge).
References
1
S. Agrawal. Multi-armed bandits and reinforcement learning lecture 5.
2
R. S. S. A. G. Barto. Reinforcement Learning: An Introduction. Bradford Book, 2018.
3
O. Chapelle and L. Li. An empirical evaluation of thompson sampling. Advances in neural
information processing systems, 24, 2011.