Since 2025.8.22, after an impressive talk with Prof. Pan Ling.
-
A: Ambitious.
-
B: Big Vision.
-
C: Communication.
-
D: Define & Design.
-
E: Execute.
-
Tic-Tac-Toe, with value iteration.
-
TODO: The algorithm fails, while it should have worked according to the book...
-
MAB with Action-Value Methods (
$\epsilon\text{-Greedy}$ ).-
MAB with Optimistic Initial Values.
- Note: pay extra attention to the spike at the beginning of the curve for the greedy method. The reason behind this is that after 10-steps of exploration (the first steps will always be exploration), all Q-values are around 4.5 (all have been tried once). Then, the greedy method will keep exploiting the arm with the highest ground-truth value, causing the spike. After this exploitation, the Q-value of this arm will drop below others, leading to a drop in performance.
-
MAB using UCB.
-
Different confidence level
$c$ values.- Note: pay attention to the spikes in the curves. From the very beginning, the model will try each arm once. So at the 10th step, the UCB item is equal for all arms. Then the model is expected to choose the arm with the highest Q-value, leading to a spike. After this exploitation, if
$c$ is large, the UCB term of other arms will quickly surpass that of the exploited arm; while if$c$ is small, it requires more exploitation steps for the UCB term of other arms to surpass that of the exploited arm.
- Note: pay attention to the spikes in the curves. From the very beginning, the model will try each arm once. So at the 10th step, the UCB item is equal for all arms. Then the model is expected to choose the arm with the highest Q-value, leading to a spike. After this exploitation, if
-
-
MAB using Gradient Bandit Algorithms.
Using Gridworld example from the book.
Using Jack's Car Rental example from the book.
Using Gambler's Problem example from the book.
Using Blackjack example from the book.
* [TIPS]: The MCES algorithm will lead to biased
Using Racetrack example from the book.
Several Problems Confronted...
-
I tried to use
$\epsilon=0.5$ for the behavior policy, but this resulted in the problem that the training process only concentrated in the last few states (because it $\text{Action}\text{target}=\text{Action}\text{behavior}$ could be hardly satisfied). So I set$\epsilon=0.1$ . -
I tried to used Every-Visit MC, but this method resulted in quite unbalanced learning: that in the episodes, there are a lot more visits to the
$(s,a)$ that "have been" learned by the agent (since we used greedy policy, and the behavior policy is$\epsilon$ -greedy). When learning, these datas dominated the episodes.-
Or, mathematically, the Every-Visit method is biased.
-
Consequently, after about 1000 episodes, the agent almost always chose the same trajectories.
-
-
Finally, I used First-Visit MC, which solved the above problem.
Using Cliff Walking example from the book.
-
Why Q-Learning is tend to stay near to the cliff while SARSA stays away?
-
Q-Learning learns the optimal policy, without considering the current policy, which is
$\epsilon$ -greedy. -
But while training, the agent takes
$\epsilon$ -greedy actions, which may lead to falling off the cliff. Thus, the training curve of Q-Learning is lower than that of SARSA.
-
Using Windy Gridworld example from the book.
DEBUG INFO: When using n-step methods, never forget to fully train the agent on those steps near the end of each episode. (i.e.,
Use 3 different Maze examples from the book.
Use Racetrack example from the book.
Use Racetrack example from the book.
[TIPS] The policy evaluated here is the trained RaceTrack_Policy.py.
Use Access Control Queuing example from the book.