UCB Algorithm in RL

Upper Confidence Bound (UCB) is a reinforcement learning algorithm that balances exploration and exploitation by optimistically selecting actions with high uncertainty. UCB1, a specific algorithm in the UCB family, works by playing each action once initially, then selecting subsequent actions to maximize expected reward plus an uncertainty term that favors less frequently played actions. Over time, as actions are selected more often and uncertainty decreases, UCB1 will identify and repeatedly select the optimal action.

Uploaded by

ksaipraneeth1103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views3 pages

UCB Algorithm in RL

Uploaded by

ksaipraneeth1103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

REINFORCEMENT LEARNING

Upper Confidence Bound

Upper Confidence Bound (UCB) is the most widely used solution

method for multi-armed bandit problems. This algorithm is based on
the principle of optimism in the face of uncertainty.

In other words, the more uncertain we are about an arm, the more
important it becomes to explore that arm.

 Distribution of action-value functions for 3 different arms a1, a2 and

a3 after several trials is shown in the figure above. This distribution
shows that the action value for a1 has the highest variance and
hence maximum uncertainty.

UCB is actually a family of algorithms. Here, we will discuss UCB1.

Steps involved in UCB1:

 Play each of the K actions once, giving initial values for mean
rewards corresponding to each action at
 For each round t = K:
 Let Nt(a) represent the number of times action a was played so far
 Play the action at maximising the following expression:

 Observe the reward and update the mean reward or expected payoff
for the chosen action

 Each time a is selected, the uncertainty is presumably reduced: N t(a)

increments, and, as it appears in the denominator, the uncertainty
term decreases.

 On the other hand, each time an action other than a is selected, t

increases, but Nt(a) does not; because t appears in the numerator,
the uncertainty estimate increases.

 The use of the natural logarithm means that the increases get smaller
over time; all actions will eventually be selected, but actions with
lower value estimates, or that have already been selected frequently,
will be selected with decreasing frequency over time.
 This will ultimately lead to the optimal action being selected
repeatedly in the end.

RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
Reinforcement Learning Insights
No ratings yet
Reinforcement Learning Insights
6 pages
Upper Confidence Bound Algorithm in Reinforcement Learning
No ratings yet
Upper Confidence Bound Algorithm in Reinforcement Learning
6 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
Multi-Arm Bandit Problem Guide
No ratings yet
Multi-Arm Bandit Problem Guide
10 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
A Handy Guide To UCB Algorithm in Reinforcement Learning.
No ratings yet
A Handy Guide To UCB Algorithm in Reinforcement Learning.
14 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Bandits
No ratings yet
Bandits
2 pages
RL Week - 2 - 3
No ratings yet
RL Week - 2 - 3
83 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
Multi-Arm-Bandit Problem
No ratings yet
Multi-Arm-Bandit Problem
11 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Bandit Problem - Week 2
No ratings yet
Bandit Problem - Week 2
18 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
Reinforcement Learning Q&A Guide
No ratings yet
Reinforcement Learning Q&A Guide
10 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
KLUCB Paper
No ratings yet
KLUCB Paper
59 pages
RL Unit-2
No ratings yet
RL Unit-2
67 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
RL Unit
No ratings yet
RL Unit
595 pages
Module 6 2nd Ungraded Quizz
No ratings yet
Module 6 2nd Ungraded Quizz
13 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
Assignment 2 - Solution
No ratings yet
Assignment 2 - Solution
5 pages
Bandit
No ratings yet
Bandit
8 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Lecture 2 - Exploration and Control - Slides
No ratings yet
Lecture 2 - Exploration and Control - Slides
51 pages
MCQ& FB - Unit 1
No ratings yet
MCQ& FB - Unit 1
9 pages
Module 6 1 Ungraded Quizz
No ratings yet
Module 6 1 Ungraded Quizz
13 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
EXP3
No ratings yet
EXP3
36 pages
Solution 2
No ratings yet
Solution 2
5 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Video Transcript Module 6 - Lesson 1
No ratings yet
Video Transcript Module 6 - Lesson 1
13 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
9 DeepReinforcementLearning
No ratings yet
9 DeepReinforcementLearning
138 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
An Analysis of Multi-Armed Bandit Algorithms
No ratings yet
An Analysis of Multi-Armed Bandit Algorithms
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Rlassignment 2
No ratings yet
Rlassignment 2
3 pages
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
No ratings yet
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
9 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Ipl 60
No ratings yet
Ipl 60
71 pages
NNDL Unit1
No ratings yet
NNDL Unit1
28 pages
Krishnasai
No ratings yet
Krishnasai
67 pages
Anil Kumar
No ratings yet
Anil Kumar
22 pages
B.SC Cs Batchno 18
No ratings yet
B.SC Cs Batchno 18
42 pages
AI Lab Manual
No ratings yet
AI Lab Manual
14 pages

UCB Algorithm in RL

Uploaded by

UCB Algorithm in RL

Uploaded by

REINFORCEMENT LEARNING

Upper Confidence Bound

Upper Confidence Bound (UCB) is the most widely used solution

 Distribution of action-value functions for 3 different arms a1, a2 and

UCB is actually a family of algorithms. Here, we will discuss UCB1.

Steps involved in UCB1:

 Each time a is selected, the uncertainty is presumably reduced: N t(a)

 On the other hand, each time an action other than a is selected, t

You might also like