Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
123 views3 pages

UCB Algorithm in RL

Upper Confidence Bound (UCB) is a reinforcement learning algorithm that balances exploration and exploitation by optimistically selecting actions with high uncertainty. UCB1, a specific algorithm in the UCB family, works by playing each action once initially, then selecting subsequent actions to maximize expected reward plus an uncertainty term that favors less frequently played actions. Over time, as actions are selected more often and uncertainty decreases, UCB1 will identify and repeatedly select the optimal action.

Uploaded by

ksaipraneeth1103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views3 pages

UCB Algorithm in RL

Upper Confidence Bound (UCB) is a reinforcement learning algorithm that balances exploration and exploitation by optimistically selecting actions with high uncertainty. UCB1, a specific algorithm in the UCB family, works by playing each action once initially, then selecting subsequent actions to maximize expected reward plus an uncertainty term that favors less frequently played actions. Over time, as actions are selected more often and uncertainty decreases, UCB1 will identify and repeatedly select the optimal action.

Uploaded by

ksaipraneeth1103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

REINFORCEMENT LEARNING

Upper Confidence Bound

Upper Confidence Bound (UCB) is the most widely used solution


method for multi-armed bandit problems. This algorithm is based on
the principle of optimism in the face of uncertainty.

In other words, the more uncertain we are about an arm, the more
important it becomes to explore that arm.

 Distribution of action-value functions for 3 different arms a1, a2 and


a3 after several trials is shown in the figure above. This distribution
shows that the action value for a1 has the highest variance and
hence maximum uncertainty.

UCB is actually a family of algorithms. Here, we will discuss UCB1.

Steps involved in UCB1:

 Play each of the K actions once, giving initial values for mean
rewards corresponding to each action at
 For each round t = K:
 Let Nt(a) represent the number of times action a was played so far
 Play the action at maximising the following expression:

 Observe the reward and update the mean reward or expected payoff
for the chosen action

 Each time a is selected, the uncertainty is presumably reduced: N t(a)


increments, and, as it appears in the denominator, the uncertainty
term decreases.

 On the other hand, each time an action other than a is selected, t


increases, but Nt(a) does not; because t appears in the numerator,
the uncertainty estimate increases.

 The use of the natural logarithm means that the increases get smaller
over time; all actions will eventually be selected, but actions with
lower value estimates, or that have already been selected frequently,
will be selected with decreasing frequency over time.
 This will ultimately lead to the optimal action being selected
repeatedly in the end.

You might also like