Supports only Python3 (oops).
Most (deep) RL algorithms work by optimizing a neural network through interacting with a learning environment. The goal of this package is to minimize the implementation effort of RL practitioners. They only need to implement (or, more commonly, wrap) an OpenAI-gym environment and a neural network they want to use as a tf.keras model (along with an interface function that turns the observation from the gym environment into a format that can be fed into the tf.keras model), in order to run RL algorithms.
pip install -e . in your favorite virtual env.
- tensorflow>=1.5.0
- gym>=0.9.6
- gym[atari] (optional)
-
Actor-critic family
- A3C (https://arxiv.org/abs/1602.01783) Actor-critic, using a critic-based advantage function as the baseline for variance reduction, asynchronous parallel.
- ACER (https://arxiv.org/abs/1611.01224) A3C with uniform replay, using the Retrace off-policy correction. The critic becomes a state-action value function instead of a state-only function. The authors proposed a trust-region optimization scheme based on the KL divergence wrt a Polyak averaging policy network. This implementation however includes the KL divergence (with a tunable scale factor) in the total loss. This choice is less stable wrt change in hyperparameters, but simplifies the combination of ACER and ACKTR.
- IMPALA (https://arxiv.org/abs/1802.01561) A3C with replay and another (actually, a simpler) flavor of off-policy correction called V-trace. This implementation is a lot more naive compared with the original distributed framework, however it gives an idea of how the off-policy correction is done and is much easier to integrate with ACKTR.
-
DQN family
- DQN (https://arxiv.org/abs/1602.01783) Asynchronous multi-step Q-learning.
-
Algorithm related options
noisynet='ig'ornoisynet='fg': Based on the idea of NoisyNet (https://arxiv.org/abs/1706.10295), which introduces independent ('ig') or factorized ('fg') Gaussian noises to network weights. Allows the exploration strategy to change across different training stages and adapt to different parts of the state representation.
- Arguments shared by
trainerandevaluatorclassesenv_maker: callable. Returns a gym env on calling. Detailed in the Gym Environment section below. Default:None.state_to_input: callable. Converts theobservationfrom a gym env to some data (usually NumPy array) that can be fed into atf.kerasmodel. Detailed in the Neural network section below. Default:None(will setself.state_to_input = lambda x: xinternally if set toNone).load_model:str. File name (full path) of ah5pyfile that contains a savedtf.kerasmodel (usually saved throughtf.keras.models.Model:save). If specified, training or evaluation will start from this model. Default:None.load_model_custom:dict. As same as thecustom_objectsargument intf.keras.models.load_model. Default:None.verbose:bool. Whether or not to print training/evaluating information. Default:False.
trainerclasses common argumentsfeature_maker: callable. Takes inenv.observation_spaceand returns(inp_state, feature), a 2-tuple of atf.keras.layers.Inputlayer and an arbitrary typed (e.g.,tf.keras.layes.Dense)tf.keraslayer. Detailed in the Neural network section below. Default:None.model_maker: callable. Takes in a gym env and returns atf.kerasmodel. Detailed in the Neural network section below. The trainer will ignorefeature_makerifmodel_makeris set. Default:None.num_parallel:int. Number of parallel processes in training. Default: number of cpu (logical) core counts.port_begin:int. Starting gRPC port number used by distributed tensorflow. Default:2220.discount:float. Discount factor (gamma) in reinforcement learning. Default:0.99.train_steps:int. Maximum number of gym env steps in training. Default:1000000.rollout_maxlen:int. Maximum length of a rollout. Also the number of env steps in a rollout list. Please refer to the comments in drlbox/trainer/trainer_base.py for detail explanation. Default:32.batch_size:int. Number of rollout lists in a batch. Please refer to the comments in drlbox/trainer/trainer_base.py for details. Default:1.online_learning:bool. Whether or not to perform online learning on a newly collected batch. Default:True.replay_type:Noneorstr. Type of the replay memory. Choices are[None, 'uniform']whereNonemeans no replay memory. Default:None(note: some algorithms such as ACER and IMPALA will setreplay_type='uniform'by default).replay_ratio:int. After putting a newly collected online batch into the replay memory, a random integer number of offline, off-policy batch learnings will be performed, and the random integer number will be coming from a Poisson distribution using this argument as the Poisson parameter. Default:4.replay_kwargs:dict. Keyword arguments that will be passed to the replay constructor after combining with the default replay keyword argumentsdict(maxlen=1000, minlen=100). Default:{}.optimizer:stror atf.train.Optimizerinstance.strchoices are['adam']. Default:adam.opt_clip_norm:float. Maximum global gradient norm for gradient clipping. Default:40.0.opt_kwargs:dict. Keyword arguments that will be passed to the optimizer constructor after combining with the default keyword arguments. For'adam', the default keyword arguments aredict(learning_rate=1e-4, epsilon=1e-4). Default:{}.noisynet:Noneorstr. Whether or not to enable NoisyNet in building the neural net. Detailed in the above Algorithm related options section.strchoices are['fg', 'ig']corresponding to factorized and independent Gaussian noises, respectively. Default:None.save_dir:str. Path to save intermediatetf.kerasmodels during training. Will not save any model if set toNone. Defaul:None.save_interval:int. Number of (global) env steps between savingtf.kerasmodels during training. Default:10000.catch_signal:bool. Whether or not to catchsigintandsigtermduring multiprocess training. Useful in cleaning up dangling processes when run in background but may prevent other parts of the program to respond to signals. Default:False.
algorithm='a3c'introduces the following additional argumentsa3c_entropy_weight:float. Weight of the entropy term in the A3C loss. Whennoisynetis notNone, it is recommended to set this argument to0.0. Default:1e-2.
algorithm='acer'introduces the following additional argumentsacer_kl_weight:float. Weight of the KL divergence term wrt the average net in the ACER loss. Default:1e-1.acer_trunc_max:float. Truncating threshold in ACER's modified Retrace off-policy correction. Default:10.0.acer_soft_update_ratio:float. Soft update ratio to the average net. At each online network weight update, the weights in the average net will be a convex combination of the old average net weights and the new online network weights, and the coefficient of the new online network weights is this argument. Default:0.05.
algorithm='impala'introduces the following additional argumentsimpala_trunc_rho_max:float. Truncating threshold rho in IMPALA's V-trace off-policy correction. Default:1.0.impala_trunc_c_max:float. Truncating threshold c in IMPALA's V-trace off-policy correction. Default:1.0.
algorithm='dqn'introduces the following additional argumentsdqn_double:bool. Whether to perform double DQN update or regular DQN update. Default:True.dqn_dueling:bool. Whether to setup the DQN network as a dueling network (https://arxiv.org/abs/1511.06581). Default:False.policy_eps_start:float. Starting epsilon in the linearly decayed epsilon greedy policy. Default:1.0.policy_eps_end:float. Ending epsilon in the linearly decayed epsilon greedy policy. Default:0.01.policy_eps_decay_steps:int. Number of (per-process) env steps before the linearly decayed epsilon to reach its minimum. Default:1000000.sync_target_interval:int. Number of online updates between two synchronizations of the target network. Default:1000.
evaluatorargumentsrender_timestep:Noneorfloat. Timestep between twoenv.render()calls.Nonemeans no rendering. Default:None.render_end:bool. If set toTrue, will do oneenv.render()call after each episode ends. Default:False.num_episodes:int. Number of evalution episodes. Default:20.policy_type:str. Type of evaluation policy. Choices are['stochastic', 'greedy']. Default:stochastic.policy_eps:float. Epsilon in the epsilon greedy policy whenpolicy_type='greedy'. Default:0.0.
A minimal demo could be as simple as the following code snippet (in examples/cartpole_a3c.py). (A3C algorithm, CartPole-v0 environment, and a 2-layer fully-connected net with 200/100 hidden units in each layer.)
'''
cartpole_a3c.py
'''
import gym
from tensorflow.python.keras.layers import Input, Dense, Activation
from drlbox.trainer import make_trainer
'''
Input arguments:
observation_space: Observation space of the environment;
num_hid_list: List of hidden unit numbers in the fully-connected net.
'''
def make_feature(observation_space, num_hid_list):
inp_state = Input(shape=observation_space.shape)
feature = inp_state
for num_hid in num_hid_list:
feature = Dense(num_hid)(feature)
feature = Activation('relu')(feature)
return inp_state, feature
'''
A3C, CartPole-v0
'''
if __name__ == '__main__':
trainer = make_trainer(
algorithm='a3c',
env_maker=lambda: gym.make('CartPole-v0'),
feature_maker=lambda obs_space: make_feature(obs_space, [200, 100]),
num_parallel=1,
train_steps=1000,
verbose=True,
)
trainer.run()The user is supposed to implement a env_maker callable which returns an OpenAI-gym environment. Things like history stacking/frame skipping/reward engineering are usually handled here as well.
The above code snippet contains a trivial example:
env_maker=lambda: gym.make('CartPole-v0')which is a callable that returns the 'CartPole-v0' environment.
The user is supposed to implement a feature_maker callable which takes in an observation_space (explanation) and returns inp_state, a tf.keras.layers.Input layer, and feature, a tf.keras layer or a tuple of 2 tf.keras layers. For example, with actor-critic algorithms, when feature is a tf.keras layer, the actor and the critic streams share a common stack of layers. When feature is a tuple of 2 tf.keras layers, the actor and the critic will be completely separated).
The above code snippet cartpole_a3c.py also contains a trivial example for the part of a tf.keras model:
from tensorflow.python.keras.layers import Input, Dense, Activation
'''
Input arguments:
observation_space: Observation space of the environment;
num_hid_list: List of hidden unit numbers in the fully-connected net.
'''
def make_feature(observation_space, num_hid_list):
inp_state = Input(shape=observation_space.shape)
feature = inp_state
for num_hid in num_hid_list:
feature = Dense(num_hid)(feature)
feature = Activation('relu')(feature)
return inp_state, featurewhich makes a fully-connected neural network until the last layer before the policy/value layer. To use the default feature maker, simply let the feature-maker callable be feature_maker=lambda obs_space: make_feature(obs_space, [200, 100]).
Alternatively, it is possible to specify a full tf.keras model by implementing a model_maker callable. model_maker should take in the full gym env and returns a tf.keras model that satisfies the output requirements for each kind of training algorithm. Its model.inputs should always be a 1-tuple like (inp_state,) where inp_state is a tf.keras.layers.Input layer. Its model.outputs should also be a tuple but the content varies according to the selected algorithm. For example, with algorithm='a3c', model.outputs should be a 2-tuple of (logits, value); with algorithm='dqn', model.outputs should be a 1-tuple of (q_values,).
The following code snippet contains a trivial example for implementing a full tf.keras model for A3C or IMPALA:
from tensorflow.python.keras.initializers import RandomNormal
from tensorflow.python.keras.layers import Input, Dense, Activation
from tensorflow.python.keras.models import Model
'''
Input arguments:
env: Gym env;
num_hid_list: List of hidden unit numbers in the fully-connected net.
'''
def make_feature(env, num_hid_list):
inp_state = Input(shape=env.observation_space.shape)
feature = inp_state
for num_hid in num_hid_list:
feature = Dense(num_hid)(feature)
feature = Activation('relu')(feature)
logits_init = RandomNormal(stddev=1e-3)
logits = Dense(env.action_space.n, kernel_initializer=logits_init)(feature)
value = Dense(1)(feature)
return Model(inputs=inp_state, outputs=[logits, value])A more detailed example can be found in examples/breakout_acer.py.
The user is also supposed to implement a state_to_input callable which takes in the observation from the output of the OpenAI-gym environment's reset or step function (explanation) and returns something that a tf.keras model can directly take in. Usually, this function does stuffs like numpy stackings/reshapings/etc. By default, state_to_input is set to None, in which case the a dummy callable state_to_input = lambda x: x will be created and used internally.
Note: So long as feature_maker or model_maker is implemented correctly, the trainer will run. However, to utilize the saving/loading functionalities provided by Keras in a hassle-free manner, when writing feature_maker or model_maker it is recommended to only use combinations of Keras layers that already exist, plus some viable NumPy utilities such as np.newaxis (NumPy has to be imported as import numpy as np as this is the default importing method assumed by Keras in 'keras/layers/core.py'). It is discouraged to use other modules including plain TensorFlow, as the Keras model loading utility will literally "remember" your code of generating the Keras model and run through the code when it tries to load a saved model. If we really have to, try to import the needed functionalities inside feature_maker or model_maker so that it will be imported before execution. However, please do not import the entire TensorFlow (from tensorflow import x is fine but no import tensorflow as tf) in feature_maker or model_maker as it will cause circular importing.