Support for MultiDiscrete and MultiBinary action spaces in PPO #30
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
closes #19
Addresses #19. Adds support for
MultiDiscreteandMultiBinaryaction spaces toPPO.Constructs a multivariate categorical distribution through Tensorflow Probability's
IndependentandCategorical. Note that theCategoricaldistribution requires every variable to have the same number of categories. Therefore, I pad the logits to the largest shape across the dimensions (pad by-infto ensure that these invalid actions have zero probability).MultiBinaryis handled as a special case ofMultiDiscretewith two choices per categorical variable.Only one-dimensional action spaces are supported, so using, e.g.,
MultiDiscrete([[2],[3]])orMultiBinary([2, 3])will result in an exception (as in stable-baselines3).Testing
I added some tests (
tests/test_space, similar to the tests in stable-baselines3) that check if there are errors during learning and that the correct exceptions are raised if PPO is used with multi-dimensionalMultiDiscreteandMultiBinaryaction spaces.To check whether there are issues with the learning performance, I compared the performance to stable-baselines3's PPO on
MultiDiscreteandMultiBinaryaction space environments. Since there are no environments with these action spaces in the classic Gym benchmarks, I used a discretized action version of Reacher and a binary action version of Acrobot for testing purposes (see the wrappers below).Test script for
MultiDiscreteaction spaces:Test script for
MultiBinaryaction spaces:Results: sbx's and stable-baselines3's PPO have the same learning performance.
Motivation and Context
Types of changes
Checklist:
(The changelog seems to be in the stable-baselines3 repository, so I would need to create a separate PR for that)
(There is no separate documentation for sbx that I could update)
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)make doc(required)Note: You can run most of the checks using
make commit-checks.Note: we are using a maximum length of 127 characters per line