Some environments have a large number of possible actions. For example, consider playing Go on a
A simple approach would be to output a value for each of the
Let's denote the set of all states as
With this, we can define functions
Using the action-state map offers several advantages:
- Provides useful information about actions given knowledge about the environment.
For example, Go is a deterministic environment, so if we define
- Handling various action shapes and illegal actions becomes easier.
Consider the Big Two game environment mentioned earlier. With numerous possible hand combinations, calculating logits and action values for every possible action poses challenges, especially with illegal actions.
In BigTwo, there are thousands of possible hand combinations. Hence, computing logits and state-action values for all combinations in every scenario is highly inefficient. However, utilizing the action-state map allows the agent to compute logits and state-action values only for actions it can take. Thus, regardless of the action shape or the existence of illegal actions, by focusing on feasible actions within the action state, the agent can easily handle the gap between feasible and total actions.
when we try to train a policy using the action-state map, we face the challenge of having to recompute values for all action-states to obtain
Let's denote the logit when taking action
The issue we face now is having to compute logits for actions not taken. To overcome this, we'll store the value
Now, let's denote the policies and logits corresponding to two parameters
Then, the following holds:
Theorem 3.1
Proof) Since
Therefore,
The crucial point here is that
Set
Set
Set
Set
Set
Set
Set
For
ㅤㅤcalculate
Set
Set
In cases where we do not use an approximation of
We can simply employ the 'max advantage action incentive' defined as:
where
There are some notable points:
- If
$\pi$ is optimal, the max advantage action incentive is always 0. - If the max advantage action incentive increases,
$\pi(s, \text{argmax}_a Q(s, a))$ tends to increase.
-
Why do we estimate state-action values instead of state values?
To calculate state values, only the state is required. However, in the case of action-states, they depend not only on the state but also on the action. Therefore, the simplest structure is to estimate state-action values rather than state values. Moreover, by estimating state-action values using this method, we can still calculate the policy probabilities, which is an advantage as it allows us to obtain state values. It is also possible to calculate state values using common states. For more details, refer to Appendix B: State-value based action-state algorithm. -
Is it necessary to approximate the policy using Algorithm 3.1?
This algorithm is a method for fast computation in environments with many actions, but it is not always preferred. Especially when each action-state has a similar distance,$\rho$ tends to change accordingly. If accurate probability calculation is more important than computational efficiency, there is no need to use this algorithm. -
Is the max advantage action incentive necessary?
Experimental results have shown that agents using learned policies perform better than greedy policies. In other words, state-action values can be overestimated. However, by providing the max advantage action incentive, we can further explore overestimated actions and more accurately determine if those actions truly have such values.
I have been utilizing this algorithm in BigTwo and achieved superhuman performance despite having relatively small parameters. For further details, please refer to BigTwo. (but not use approximation of
Set
Set
Set
Set
Set
Set
Randomly initialize the actor and critic parameters
Initialize the old actor network
For
Set
Let
While not truncated nor terminated do
Step using policy
Stack
For
Sample train data in
Optimize
Calculate probability ratio
Optimize
Clear
The state-value based action-state algorithm calculates state values and action probabilities using two components instead of predicting state-action values as follows:
- Common state: Used solely for computing state values and together with action state for calculating action probabilities.
- Action state: Used in conjunction with common state for computing action probabilities.
The motivation behind recurrent Q-network is to train a neural network to learn the action-state map
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press.
- Ravichandiran, S. (2020). Deep reinforcement learning in Python: A hands-on guide. Pearson Education.