Releases: rl-tools/rl-tools
v2.2.0
To prepare RLtools for mixed-precision training we introduce a type policy numeric_types::Policy. Until now it was easy to switch the floating point type in RLtools because everything depended on the T parameter. For modern deep leraning this is not sufficient because we would like to configure different types for different parts of the models / algorithms (e.g. bf16 for parameters, fp32 for gradient/optimizer state). To facilitate this it is not sufficient to pass around a single type parameter.
Hence we created numeric_types::Policy to enable flexible type configuration:
using namespace rlt::numeric_types;
using PARAMETER_TYPE_RULE = UseCase<categories::Parameter, float>;
using GRADIENT_TYPE_RULE = UseCase<categories::Gradient, float>;
using TYPE_POLICY = Policy<double, PARAMETER_TYPE_RULE, GRADIENT_TYPE_RULE>;The TYPE_POLICY is then passed instead of T, e.g.:
using MODEL_CONFIG = rlt::nn_models::mlp::Configuration<TYPE_POLICY, TI, OUTPUT_DIM, NUM_LAYERS, HIDDEN_DIM, ACTIVATION_FUNCTION, ACTIVATION_FUNCTION_OUTPUT>;
In the codebase the TYPE_POLICY is then queried as follows:
using PARAMETER_TYPE = TYPE_POLICY::template GET<categories::Parameter>;
using GRADIENT_TYPE = TYPE_POLICY::template GET<categories::Gradient>;
using GRADIENT_TYPE = TYPE_POLICY::template GET<categories::Optimizer>;
This allows for a very flexible configuration. If a tag is not set (like categories::Optimizer in this case), it will fall back to TYPE_POLICY::DEFAULT which is double in this case (the first argument). TYPE_POLICY::DEFAULT is also the type that should be used for configuration variables and other variables that do not clearly fall under the categories. You can also define custom category tags yourself, easily. More about that will be covered in a section of the documentation at https://docs.rl.tools in the future.
This is a small API change but it appears in many places, so we implemented it ASAP (without mixed precision training itself being implemented, yet) such that there will be less confusion in the future where we expect these kinds of API to be more stable.
Currently, the advice is to just create a TYPE_POLICY = rlt::numeric_types::Policy<{float,double}>;' and pass it everywhere. You might encounter errors when trying to access e.g. some SPEC::Twhich you should be able to replace withSPEC::TYPE_POLICY::DEFAULTfor identical behavior. In general the behavior should be exactly identical as long as you configure the same float type you used forT` before.
v2.1.0
- Cleaning up repo structure: Before we used
rl-tools/rl-toolsas a monorepo to version everything in the RLtools universe together. This is not great if someone just wants the header-only library (aka just the./include), onegit clone --recursiveorgit submodule update --init --recursivecould trigger gigabytes of downloads. Also, jumping around the history is cumbersome with all the submodules (e.g. for bisecting etc.). Hence, we moved the versioning of adjacent projects torl-tools/monoandrl-tools/rl-toolsis now the submodule-free, lightweight core (~7mb download for full history). - Memory: The main work between
v2.0andv2.1has been on maturing the memory implementation (aka using RNNs in off-policy algorithms). For more information see the RNN and Memory chapters in the documentation - Flag Environment: We introduce a basic environment to test the recurrent RL algorithms, where the position of two flags is revealed in the initial step and the policy has to memorize them to visit the positions in order. The second position is required because with only one position the agent could cheat by just accelerating into the right direction and hence storing the direction in the state instead of memorizing it internally. You can see the Flag environment and a baseline policy at https://zoo.rl.tools
- Adding inference utils: We have added some common inference utils in
include/rl_tools/inferencethat e.g. can expose a pure C interface (e.g. for microcontroller integrations) and more. - Full CUDA training: We have revived the full on-GPU training it is tracked here. It supports full CUDA graph capture which means 1 loop step = 1 graph execution
- L2F: The L2F simulator has been modularized and structured better
v2.0.0
- ExTrack experiment tracking conventions
- studio.rl.tools and UI conventions (HTML5 Canvas)
- Zoo: Tuned environment + algorithm examples (zoo.rl.tools)
- MAPPO: Multi-Agent PPO
- L2F: Upstreaming the Learning to Fly in Seconds simulator as a first class citizen in RLtools (batteries included but swappable)
- Improved Environment Interface
- and more...
v1.1.0
making esp32 operations compatible with the new structure
v1.0.0
update linux release instructions