REINFORCEMENT LEARNING
Upper Confidence Bound
Upper Confidence Bound (UCB) is the most widely used solution
method for multi-armed bandit problems. This algorithm is based on
the principle of optimism in the face of uncertainty.
In other words, the more uncertain we are about an arm, the more
important it becomes to explore that arm.
Distribution of action-value functions for 3 different arms a1, a2 and
a3 after several trials is shown in the figure above. This distribution
shows that the action value for a1 has the highest variance and
hence maximum uncertainty.
UCB is actually a family of algorithms. Here, we will discuss UCB1.
Steps involved in UCB1:
Play each of the K actions once, giving initial values for mean
rewards corresponding to each action at
For each round t = K:
Let Nt(a) represent the number of times action a was played so far
Play the action at maximising the following expression:
Observe the reward and update the mean reward or expected payoff
for the chosen action
Each time a is selected, the uncertainty is presumably reduced: N t(a)
increments, and, as it appears in the denominator, the uncertainty
term decreases.
On the other hand, each time an action other than a is selected, t
increases, but Nt(a) does not; because t appears in the numerator,
the uncertainty estimate increases.
The use of the natural logarithm means that the increases get smaller
over time; all actions will eventually be selected, but actions with
lower value estimates, or that have already been selected frequently,
will be selected with decreasing frequency over time.
This will ultimately lead to the optimal action being selected
repeatedly in the end.