Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Stevendiffdiff/LearningRL

Repository files navigation

LRL: Learning Reinforcement Learning

Since 2025.8.22, after an impressive talk with Prof. Pan Ling.

0. Your goals.

  1. A: Ambitious.

  2. B: Big Vision.

  3. C: Communication.

  4. D: Define & Design.

  5. E: Execute.

1. Introduction.

  • Tic-Tac-Toe, with value iteration.

  • TODO: The algorithm fails, while it should have worked according to the book...

2. Multi-Armed Bandit (MAB) problems.

  • MAB with Action-Value Methods ($\epsilon\text{-Greedy}$).

    • Different $\epsilon$ values. MAB with Action-Value Methods

    • Different step-size $\alpha$ strategies. MAB with Action-Value Methods

    • MAB with Optimistic Initial Values. MAB with Optimistic Initial Values

      • Note: pay extra attention to the spike at the beginning of the curve for the greedy method. The reason behind this is that after 10-steps of exploration (the first steps will always be exploration), all Q-values are around 4.5 (all have been tried once). Then, the greedy method will keep exploiting the arm with the highest ground-truth value, causing the spike. After this exploitation, the Q-value of this arm will drop below others, leading to a drop in performance.
  • MAB using UCB.

    • Different confidence level $c$ values. MAB using UCB

      • Note: pay attention to the spikes in the curves. From the very beginning, the model will try each arm once. So at the 10th step, the UCB item is equal for all arms. Then the model is expected to choose the arm with the highest Q-value, leading to a spike. After this exploitation, if $c$ is large, the UCB term of other arms will quickly surpass that of the exploited arm; while if $c$ is small, it requires more exploitation steps for the UCB term of other arms to surpass that of the exploited arm.
  • MAB using Gradient Bandit Algorithms.

    • Different step-size $\alpha$ values and with/without baseline. MAB using Gradient Bandit Algorithms

      • Note: in this experiment, the mean value of each action is shifted to $4.0$ (not $0.0$).

3. Finite Markov Decision Processes with DP.

3.1 Iterative Policy Evaluation.

Using Gridworld example from the book. Iterative Policy Evaluation

3.2 Policy Iteration

Using Jack's Car Rental example from the book.

  • Original setting. Policy Iteration: Optimal Policies

  • With modification. Policy Iteration: Value Functions

3.3 Value Iteration

Using Gambler's Problem example from the book. Value Iteration: Optimal Policies Value Iteration: Value Functions

4. Finite Markov Decision Processes with MC.

4.1 Monte Carlo ES (MCES) with Exploring Starts.

Using Blackjack example from the book.

  1. With $\alpha = \frac{1}{N}$.

    MCES: Optimal Policies

  2. With $\alpha = 0.005$.

    MCES: Value Functions

* [TIPS]: The MCES algorithm will lead to biased $Q$ estimates, because the policy is continually changing. This was not a problem in DP because the policy evaluation could access the exact model of the environment.

4. Off-Policy Monte Carlo Control

Using Racetrack example from the book.

Several Problems Confronted...

  1. I tried to use $\epsilon=0.5$ for the behavior policy, but this resulted in the problem that the training process only concentrated in the last few states (because it $\text{Action}\text{target}=\text{Action}\text{behavior}$ could be hardly satisfied). So I set $\epsilon=0.1$.

  2. I tried to used Every-Visit MC, but this method resulted in quite unbalanced learning: that in the episodes, there are a lot more visits to the $(s,a)$ that "have been" learned by the agent (since we used greedy policy, and the behavior policy is $\epsilon$-greedy). When learning, these datas dominated the episodes.

    • Or, mathematically, the Every-Visit method is biased.

    • Consequently, after about 1000 episodes, the agent almost always chose the same trajectories.

    Off-Policy MC Control: Every-Visit MC

  3. Finally, I used First-Visit MC, which solved the above problem.

    Off-Policy MC Control: Training Curve

    Step=0 Off-Policy MC Control: Step=0

    Step=4000 Off-Policy MC Control: Step=4000

    Step=9000 Off-Policy MC Control: Step=9000

5. Finite Markov Decision Processes with TD.

Using Cliff Walking example from the book.

Cliff Walking: Training Curve

Cliff Walking: Optimal Policies

  • Why Q-Learning is tend to stay near to the cliff while SARSA stays away?

    • Q-Learning learns the optimal policy, without considering the current policy, which is $\epsilon$-greedy.

    • But while training, the agent takes $\epsilon$-greedy actions, which may lead to falling off the cliff. Thus, the training curve of Q-Learning is lower than that of SARSA.

6. Finite Markov Decision Processes with n-Step Bootstrapping.

Using Windy Gridworld example from the book.

Windy Gridworld: Training Curve

Windy Gridworld: Optimal Policies

DEBUG INFO: When using n-step methods, never forget to fully train the agent on those steps near the end of each episode. (i.e., $\tau = T - n, T - n + 1, \ldots, T - 1$)

7. Finite Markov Decision Processes with Planning and Learning.

7.1 Dyna-Q, Dyna-Q+ and Prioritized Sweeping.

Use 3 different Maze examples from the book.

Block Path Training

Block Path

Deterministic Maze Training

Deterministic Maze

Shortcut Training

Shortcut

  1. Block the path after 50 episodes. Block Trajectories

  2. Deterministic Maze. Deterministic Trajectories

  3. Give a shortcut after 50 episodes. Shortcut Trajectories

7.2 Trajectory Sampling for Value Iteration.

Use Racetrack example from the book.

Update Heatmaps

8. Approximate Online Prediction.

Use Racetrack example from the book.

  1. Different function approximation methods. Different function approximation methods

  2. Visualization of the learned value functions. Visualization of the learned value functions

[TIPS] The policy evaluated here is the trained $\epsilon\text{-Greedy}$ policy with $\epsilon=0.1$ via RaceTrack_Policy.py. $\texttt{LSTD}$ method failed to predict the values, probably because of bad feature design. And the Artificial Neural Network method outperforms the others.

9. Approximate Online Control.

Use Access Control Queuing example from the book.

Access Control Queuing: Training Curves

Access Control Queuing: Optimal Policies

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages