LRL: Learning Reinforcement Learning

Since 2025.8.22, after an impressive talk with Prof. Pan Ling.

0. Your goals.

A: Ambitious.
B: Big Vision.
C: Communication.
D: Define & Design.
E: Execute.

1. Introduction.

Tic-Tac-Toe, with value iteration.
TODO: The algorithm fails, while it should have worked according to the book...

2. Multi-Armed Bandit (MAB) problems.

MAB with Action-Value Methods ($\epsilon\text{-Greedy}$).
- Different $\epsilon$ values.
- Different step-size $\alpha$ strategies.
- MAB with Optimistic Initial Values.
  - Note: pay extra attention to the spike at the beginning of the curve for the greedy method. The reason behind this is that after 10-steps of exploration (the first steps will always be exploration), all Q-values are around 4.5 (all have been tried once). Then, the greedy method will keep exploiting the arm with the highest ground-truth value, causing the spike. After this exploitation, the Q-value of this arm will drop below others, leading to a drop in performance.
MAB using UCB.
- Different confidence level $c$ values.
  - Note: pay attention to the spikes in the curves. From the very beginning, the model will try each arm once. So at the 10th step, the UCB item is equal for all arms. Then the model is expected to choose the arm with the highest Q-value, leading to a spike. After this exploitation, if $c$ is large, the UCB term of other arms will quickly surpass that of the exploited arm; while if $c$ is small, it requires more exploitation steps for the UCB term of other arms to surpass that of the exploited arm.
MAB using Gradient Bandit Algorithms.
- Different step-size $\alpha$ values and with/without baseline.
  - Note: in this experiment, the mean value of each action is shifted to $4.0$ (not $0.0$).

3. Finite Markov Decision Processes with DP.

3.1 Iterative Policy Evaluation.

Using Gridworld example from the book.

3.2 Policy Iteration

Using Jack's Car Rental example from the book.

Original setting.
With modification.

3.3 Value Iteration

Using Gambler's Problem example from the book.

4. Finite Markov Decision Processes with MC.

4.1 Monte Carlo ES (MCES) with Exploring Starts.

Using Blackjack example from the book.

With $\alpha = \frac{1}{N}$.
With $\alpha = 0.005$.

* [TIPS]: The MCES algorithm will lead to biased $Q$ estimates, because the policy is continually changing. This was not a problem in DP because the policy evaluation could access the exact model of the environment.

4. Off-Policy Monte Carlo Control

Using Racetrack example from the book.

Several Problems Confronted...

I tried to use $\epsilon=0.5$ for the behavior policy, but this resulted in the problem that the training process only concentrated in the last few states (because it $\text{Action}\text{target}=\text{Action}\text{behavior}$ could be hardly satisfied). So I set $\epsilon=0.1$.
I tried to used Every-Visit MC, but this method resulted in quite unbalanced learning: that in the episodes, there are a lot more visits to the $(s,a)$ that "have been" learned by the agent (since we used greedy policy, and the behavior policy is $\epsilon$-greedy). When learning, these datas dominated the episodes.
- Or, mathematically, the Every-Visit method is biased.
- Consequently, after about 1000 episodes, the agent almost always chose the same trajectories.
Finally, I used First-Visit MC, which solved the above problem.

Step=0

Step=4000

Step=9000

5. Finite Markov Decision Processes with TD.

Using Cliff Walking example from the book.

Why Q-Learning is tend to stay near to the cliff while SARSA stays away?
- Q-Learning learns the optimal policy, without considering the current policy, which is $\epsilon$-greedy.
- But while training, the agent takes $\epsilon$-greedy actions, which may lead to falling off the cliff. Thus, the training curve of Q-Learning is lower than that of SARSA.

6. Finite Markov Decision Processes with n-Step Bootstrapping.

Using Windy Gridworld example from the book.

DEBUG INFO: When using n-step methods, never forget to fully train the agent on those steps near the end of each episode. (i.e., $\tau = T - n, T - n + 1, \ldots, T - 1$)

7. Finite Markov Decision Processes with Planning and Learning.

7.1 Dyna-Q, Dyna-Q+ and Prioritized Sweeping.

Use 3 different Maze examples from the book.

Block Path

Deterministic Maze

Shortcut

Block the path after 50 episodes.
Deterministic Maze.
Give a shortcut after 50 episodes.

7.2 Trajectory Sampling for Value Iteration.

Use Racetrack example from the book.

8. Approximate Online Prediction.

Use Racetrack example from the book.

Different function approximation methods.
Visualization of the learned value functions.

[TIPS] The policy evaluated here is the trained $\epsilon\text{-Greedy}$ policy with $\epsilon=0.1$ via RaceTrack_Policy.py. $\texttt{LSTD}$ method failed to predict the values, probably because of bad feature design. And the Artificial Neural Network method outperforms the others.

9. Approximate Online Control.

Use Access Control Queuing example from the book.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
1-Introduction-Tic-Tac-Toe		1-Introduction-Tic-Tac-Toe
2-MAB		2-MAB
3-Finite-MDP-DP		3-Finite-MDP-DP
4-Finite-MDP-MC		4-Finite-MDP-MC
5-Finite-MDP-TD		5-Finite-MDP-TD
6-Finite-MDP-nStepBootstrapping		6-Finite-MDP-nStepBootstrapping
7-Finite-MDP-PlanningAndLearning		7-Finite-MDP-PlanningAndLearning
8-Approximate-Online-Prediction		8-Approximate-Online-Prediction
9-Approximate-Online-Control		9-Approximate-Online-Control
HER-Bit-Flipping		HER-Bit-Flipping
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LRL: Learning Reinforcement Learning

0. Your goals.

1. Introduction.

2. Multi-Armed Bandit (MAB) problems.

3. Finite Markov Decision Processes with DP.

3.1 Iterative Policy Evaluation.

3.2 Policy Iteration

3.3 Value Iteration

4. Finite Markov Decision Processes with MC.

4.1 Monte Carlo ES (MCES) with Exploring Starts.

4. Off-Policy Monte Carlo Control

5. Finite Markov Decision Processes with TD.

6. Finite Markov Decision Processes with n-Step Bootstrapping.

7. Finite Markov Decision Processes with Planning and Learning.

7.1 Dyna-Q, Dyna-Q+ and Prioritized Sweeping.

7.2 Trajectory Sampling for Value Iteration.

8. Approximate Online Prediction.

9. Approximate Online Control.

About

Uh oh!

Releases

Packages

Languages

Stevendiffdiff/LearningRL

Folders and files

Latest commit

History

Repository files navigation

LRL: Learning Reinforcement Learning

0. Your goals.

1. Introduction.

2. Multi-Armed Bandit (MAB) problems.

3. Finite Markov Decision Processes with DP.

3.1 Iterative Policy Evaluation.

3.2 Policy Iteration

3.3 Value Iteration

4. Finite Markov Decision Processes with MC.

4.1 Monte Carlo ES (MCES) with Exploring Starts.

4. Off-Policy Monte Carlo Control

5. Finite Markov Decision Processes with TD.

6. Finite Markov Decision Processes with n-Step Bootstrapping.

7. Finite Markov Decision Processes with Planning and Learning.

7.1 Dyna-Q, Dyna-Q+ and Prioritized Sweeping.

7.2 Trajectory Sampling for Value Iteration.

8. Approximate Online Prediction.

9. Approximate Online Control.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages