No Longer Maintained until Go gets a Library Equivalent to PyTorch or TensorFlow
This repo is no longer under active development. Until a Go deep learning library that surpasses the quality/number of features of Gorgonia is released (e.g. something similar to tensorflow), this repo will no longer actively be maintained. But I'll will be back soon!
Until then, I've moved my code to Julia. See you when the Go neural network library comes out!
GoLearn is a reinforcement learning Go module. It implements many environments
for reinforcement learning as well as agents. It also allows users to easily
run experiments through JSON configuration files without ever touching
the code.
A number of algorithms are implemented. These are separated into the
agent/linear and agent/nonlinear packages.
The agent/linear package contains agents that only use linear function
approximation, and are implemented in GoNum. The agent/nonlinear
package contains agents that are implemented with Gorgonia for neural
network function approximation. Note that agents in the agent/nonlinear
package can use either linear or nonlinear function approximation
depending on how the neural network is set up. The nonlinear part
of the agent/nonlinear package refers to the ability of the package to use
nonlinear function approximation, but the package is not restricted to
only use nonlinear function approximation.
The following value based algorithms are implemented in the following packages:
| Agent | Package |
|---|---|
Linear Q-learning |
agent/linear/discrete/qlearning |
Linear Expected SARSA |
agent/linear/discrete/esarsa |
Deep Q-learning |
agent/nonlinear/discrete/deepq |
The following policy gradient algorithms are implemented in the following packages:
| Agent | Package |
|---|---|
Linear-Gaussian Actor-Critic |
agent/linear/continuous/actorcritic |
Vanilla Policy Gradient |
agent/nonlinear/continuous/vanillapg |
Vanilla Actor Critic |
agent/nonlinear/continuous/vanillaac |
Agents must be created with a configuration struct satisfying the Config
interface:
// Config represents a configuration for creating an agent
type Config interface {
// CreateAgent creates the agent that the config describes
CreateAgent(env environment.Environment, seed uint64) (Agent, error)
// ValidAgent returns whether the argument Agent is valid for the
// Config. This differs from the Type() method in that an actual
// Agent struct is used here, whereas the Type() method returns
// a Type (string) describing the Agent's type. For example a
// Config's Type may be "CategoricalVanillaPG-MLP" or "Gaussian
// VanillaPG-TreeMLP", which are different Config types, but each of
// which have *vanillapg.VanillaPG as a valid Agent since that
// is the Agent the Configs describe.
ValidAgent(Agent) bool
// Validate returns an error describing whether or not the
// configuration is valid.
Validate() error
// Type returns the type of agent which can be constructed from
// the Config.
Type() Type
}Each Config struct uniquely determines the set of hyperparameters and
options of a specific agent. If you try to create an Agent for which
the ValidAgent() Config method would return false, the Agent
constructor will return an error.
How can we create an agent? Imagine we have agent X defined in
package xagent. In package xagent then, there will be a Configuration
struct, for example imagine it is called xConfig. Then, if agent X
has the hyperparameters learningRate, epsilon, and exploration,
then xConfig might look something like:
type xConfig struct {
LearningRate float64
Epsilon float64
Exploration string
}This configuration uniquely defines an X agent with specific hyperparameters.
There are then two ways of creating the agent X. The easiest way is to
call the CreateAgent() method on the xConfig:
config := xagent.xConfig{
LearningRate: 0.1,
Epsilon: 0.01,
Exploration: "Gaussian",
}
env := ... // Create some environment to train the agent in
seed := uint64(time.Now().UnixNano())
agent := config.CreateAgent(env, seed)
// Train the agent hereThe other way is through the agent constructor:
config := xagent.xConfig{
LearningRate: 0.1,
Epsilon: 0.01,
Exploration: "Gaussian",
}
env := ... // Create some environment to train the agent in
seed := uint64(time.Now().UnixNano())
agent := xagent.New(config, env, seed)
// Train the agent hereOnce the agent has been constructed, it is ready to train. Remember,
you must use a valid Config to construct a specific Agent.
A very common case we find in reinforcement learning is to sweep over
a bunch of different hyperparameter settings for a single agent type.
To do this, instead of creating a []Config, we can create a ConfigList.
ConfigLists efficiently represent a set of all combinations of
hyperparameters. For example, a ConfigList for agent X might be:
type xConfigList struct {
LearningRate []float64
Epsilon []float64
Exploration []string
}This ConfigList stores a list of all combinations of values found in the
fields LearningRate, Epsilon, and Exploration. For example, assume
a specific instantiation of an xConfigList looks like this:
config := xConfigList{
LearningRate []float64{0.1, 0.5},
Epsilon []float64{0.01, 0.001},
Exploration []string{"Gaussian},
}Then, this list basically stores a slice of 5 xConfigs, each taking on
a different combination of LearningRates, Epsilons, and Explorations.
On a ConfigList, you can then call an ConfigAt(i int, c ConfigList)
function, defined in the agent package, which will
return the Config at (0-indexed) position i in the list. The ConfigAt()
function wraps around the end of the list, so if the ConfigList has
n Configs in it, indices nk + i, k ϵ Z, i < n ϵ Z, all refer
to the same Config at index i < n ϵ Z in the ConfigList.
Important: A ConfigList must have its fields having the same name
as the corresponding Config of which it stores a list of. Otherwise,
the At() method will panic.
An Experiment can be JSON serialized (more on that later). In the
Experiment JSON file, the Agent configuration will be a specific
concrete type, the TypedConfigList, which provides a privitive way
of typing ConfigLists.
A TypedConfigList provides a primitive typing mechanism of ConfigLists.
A TypedConfigList consists of a type descriptor and a concrete ConfigList
value. This might seem similar to how interface values are constructed
in Go, and that's because it is!
TypedConfigList = | agent.Type | ConfigList |Given a JSON serialization of a TypedConfigList, we can easily
JSON unmarshall this TypedConfigList, which will return the stored
ConfigList as its underlying concrete type. This is required because
we cannot easily unmarshall a concrete ConfigList into a
ConfigList (interface) value. For example, the following code will panic:
var a ConfigList // ConfigList is an interface
err := json.Unmarshall(data, a) // Data holds a concrete type
if a == nil {
panic(err)
}A (possibly JSON marshalled) Experiment (more on this later) must store
an agent ConfigList describing which hyperparameters of a specific agent
should be run in the experiment. A problem is that the marshalled
Experiment will not know before runtime what kind of Agent is being
used in the experiment and therefore cannot know the concrete type
of the ConfigList of the Agent. The best the Experiment can do
is unmarshall the ConfigList into an interface value. But this will
result in the problem shown in the code block above.
The JSON marhsalled TypedConfigList will perform ConfigList typing
for the Experiment. The Experiment will then simply call CreateAgent()
and run.
This library makes a separation between an Environment and a Task. An
Environment is simply some domain that can be acted in, but does not
return any reward for actions. An Environment has a state, and actions
can be taken in an Environment that affect the Environment's state, but
no rewards are given for any actions in an Environment. The currently
implemented environments are in the following packages:
gridworld: Implements gridworld environmentsmaze: Implements randomly generated maze environmentsclassiccontrol: Implements the classic control environments: Mountain Car, Pendulum, Cartpole, and Acrobotbox2d: Implements environments using the Box2D physics simulator Go portmujoco: Implements environments using the MuJoCo physics simulatorgym: Provides access to OpenAI Gym's environments through GoGym: Go Bindings for OpenAI Gym.
Each package also defines public constants that determine the physical
parameters of the Environment. For example, mountaincar.Gravity is
the gravity used in the Mountain Car environment.
In depth documentation for each environment is included in the source
files. You can view the documentation, which includes descriptions of
state features, bounds on state features, actions, dimensionality of
action, and more by using the go doc command or by viewing the source
files in a text editor.
Classic control environments were adapted from OpenAI Gym's implementations. All classic control environments have both discrete and continuous action variants. Box2D environments were also adapted from OpenAI Gym's implementations and also have both discrete and continuous action variants.
Although an Environment has no concept of rewards, an Environment does
have a Task, which determines the rewards taken for actions in the
Environment, the starting states in anEnvironment, and the end conditions
of an agent-environment interaction.
The envconfig package is used to construct Environments with specific
Tasks with default parameters. For example, to create the Cartpole
environment with the Balance task with default parameters, one can
first construct a Config that describes the Environment and Task.
Once the Config has been constructed, the CreateEnv() method will
return the respective Environment:
c := envconfig.NewConfig(
envconfig.Cartpole, // Environment
envconfig.Balance, // Task
true, // Continuous actions?
500, // Episode cutoff
0.99, // Discount
false, // Use the OpenAI Gym (true) or the GoLearn (false) implementation
)
env, firstStep := c.CreateEnv()
// Use env in some agent-environment interactionCurrently, the following Environment-Task combinations are valid
configurations:
| Environment | Task |
|---|---|
| Gridworld | Goal |
| Maze | Goal |
| MountainCar | Goal |
| Cartpole | Balance, SwingUp (soon to come) |
| Pendulum | SwingUp |
| Acrobot | SwingUp, Balance (soon to come) |
| LunarLander | Land |
| Hopper | Hop |
| Reacher | Reach |
Any other combination of Environment-Task will result in an error
when calling CreateEnv().
The gym package provides acces to OpenAI Gym's
environments through GoGym: Go Bindings for OpenAI Gym.
Currently, the package only supports the MuJoCo and Classic Control environment
suites since these are the only suites GoGym: Go Bindings for OpenAI Gym
currently supports. Once GoGym: Go Bindings for OpenAI Gym
has added more environments, this package will automatically work with the
new environments (given that you update GoGym: Go Bindings for OpenAI Gym).
All OpenAI Gym environments work with only their default tasks and episode cutoffs.
The timestep package manages environmental timesteps with the TimeStep
struct. Each time an Environment takes a step, it returns a new TimeStep
struct. A TimeStep contains a StepType (either First, Middle,
or Last), a reward for the previous action, a discount value, the observation
of the next state, the number of the timestep in the current episode, and the
EndType (either TerminalStateReached, Timeout, or Nil). EndTypes have
the following meanings:
TerminalStateReached: The episode ended because some terminal state (such as a goal state) was reached. For example, in Mountain Car reaching the goal, or in CartpoleBalancehaving the pole fall below some set angle.Timeout: The episode ended because a timestep limit was reached.Nil: The episode ended in some unspecified or unknown way.
The timestep package also contains a Transition struct. These are used
to model a transition of (state, action, reward, discount, next state, next action). Sometimes the next action is omitted. These structs are sent to
Agents for many different reasons, for example, to compute the TD error on a
transition in order to track the average reward.
An Environment has a Task which outlines what the agent should acomplish.
For example, the mountaincar package implements a Goal Task where,
when added to a Mountain Car environment, rewards the agent for reaching a
goal state.
Each Task has a Starter and Ender (more on those later) which determine
how episodes start and end respectively. The Task computes rewards for
state, action, next state transitions and determines the goal states
that the agent should reach. The Task interface is:
type Task interface {
Starter
Ender
GetReward(state mat.Vector, a mat.Vector, nextState mat.Vector) float64
AtGoal(state mat.Matrix) bool
Min() float64 // returns the min possible reward
Max() float64 // returns the max possible reward
RewardSpec() spec.Environment
}All Tasks have in-depth documentation in the Go source files. These
can be viewed with the go doc command.
New Tasks can easily be implemented and added to existing environments
to change what the agent should accomplish in a given environment. This
kind of modularity makes it very easy to have a single environment with
different goals. For example, the Cartpole environment has a Balance
Task, but one could easily create a new Task, e.g. SwingUp, and
pass this task into the Cartpole constructor to create a totally new
environment that is not yet implemented by this module.
Each Environment can use any struct implementing the Task interface,
but commonly used Tasks for each Environment are implemented in the
Environment's respective package.
The Starter interface determines how a Task starts in an Environment:
type Starter interface {
Start() mat.Vector
}An environment will call the Start() method of its task, which will
return a starting state. The agent will then start from this state
to complete its task. If a Starter produces a starting state that is
not possible, then the environment will panic.
The main Starter for continuous-state environments is the UniformStarter
which selects starting states drawn from a multivariate uniform distribution,
where each dimension of the multivariate distribution is given bounds of
selection. For example:
var seed uint64 = 1
bounds := []r1.Interval{{Min: -0.6, Max:0.2}, {Min-0.02, Max: 0.02}}
starter := environment.NewUniformStarter(bounds, seed)will create a UniformStarter which samples starting states [x, y]
uniformly, where x ∈ [-0.6, 0.2] and y ∈ [-0.02, 0.02].
Tasks can take in Starter structs to determine how and where the
Task will begin for each episode.
The Ender interface determines how a Task ends in an Environment:
type Ender interface {
End(*timestep.TimeStep) bool
}The End() method takes in a pointer to a TimeStep. The function checks
whether this TimeStep is the last in the episode. If so, the function
first changest the TimeStep.StepType field to timestep.Last and returns
true. If not, the function leaves the TimeStep.StepType field as
timestep.Mid and returns false.
The environment package implements global Enders which can be used
with any Environment. Sub-packages of the environment package (such
as environment/classiccontrol/mountaincar) implement Enders which
may be used for Environments within that package. The global
Enders are:
StepLimit: Ends episodes at a specific timestep limit.IntervalLimit: Ends episodes when a state feature leaves an interval.FunctionEnder: Ends an episode when a function or method returnstrue.
Environment wrappers can be founds in the environment/wrappers package.
Environment wrappers are themselves Environments, but they somehow alter
the underlying Environment that they wrap. For example, a SinglePrecision
might wrap an Environment and convert all observations to float32.
A NoiseMaker might inject noise into the environmental observations. A
TileCoder might return tile-coded representations of environmental states.
So far, the following environment wrappers are implemented:
TileCoding: Tile codes environmental observationsIndexTileCoding: Tile codes environmental observations and returns as state observations the indices of non-zero components in the tile-coded vector.AverageReward: Converts an environment to the average reward formulation, returning the differential reward at each timestep and tracking/updating the policy's average reward estimate over time. This wrapper easily converts any algorithm to its differential counterpart.
It is easy to implement your own environment wrapper. All you need to do
is create a struct that stores another Environment and have your
wrapper implement the Environment interface:
type Environment interface {
Task
Reset() timestep.TimeStep
Step(action mat.Vector) (timestep.TimeStep, bool)
DiscountSpec() spec.Environment
ObservationSpec() spec.Environment
ActionSpec() spec.Environment
}With struct embedding this is even easier. Simply embed an Environment in
your Environment wrapper, and then "override" the necessary methods. Usually,
only the Reset(), Step(), and ObservationSpec() methods need to
be overridden if you embed an Environment in your wrapper.
Environment wrappers follow the decorator design pattern to ensure
easy extension and modification of Environments.
A TileCoding is an Environment wrapper, itself being an Environment.
The TileCoding struct will tile-code all observations before passing
them to an Agent. The TileCoding struct has the constructor:
func NewTileCoding(env environment.Environment, bins [][]int, seed uint64) (*TileCoding, ts.TimeStep)The env parameter is an Environment to wrap. The bins parameter
determins both the number of tilings to use and the bins per each
dimension, per tiling. The length of the outer slice is the number of
tilings for the TileCoding Environment to use. The sub-slices determine
the number of tiles per dimension to use in each respective tiling.
For example, if we had:
bins := [][]int{{2, 3}, {16, 21}, {5, 6}}Then the TileCoding Environment would tile code all environmental
observations using 3 tilings before passing the observations back to
the Agent. The first tiling has shape 2x3 tiles. The second tiling
will have shape 16x21 tiles. The third tiling will have 5 tiles
along the first dimension and 6 along the second dimension. Note that
the length of sub-slices should be equal to the environmental
observation vector's lengths. In the example above, we would expect the
embedded Environment to return 2D vector observations.
Tile coding requires bounding the ranges of each feature dimension. The
TileCoding Environment takes care of this itself, and creates the tilings
automatically over these poritons of state feature space.
Tile coding also requires that tilings be offset from one another. Each
tiling is offset from the origin by uniformly randomly sampling an
offset between [-tileWidth / OffsetDiv, tileWidth / OffsetDiv] for
each feature dimension, where tilecoder.OffsetDiv is a global constant
that determines the degree to which tiles are offset from the origin.
Note that tilings are offset for each state feature dimension, meaning
that the tiling needs to be offset in each dimension that is tile coded.
If state features are n-D, then the offset of the tilings is also n-D.
The implementation of tile coding in this library makes use of concurrency.
The tile coded vector is constructed by calculating the non-zero indices
generated by each tiling concurrently and then setting each of these
indices sequentially in a separate goroutine. A separate goroutine is
used to set the indices so that the indices can be set as soon as one of
the concurrent goroutines for calculating an index is finished.
This greatly reduces the computation time when there are many tilings and
many tiles per tiling.
The IndexTileCoding wrapper is conceptually identical to the TileCoding
wrapper except that the wrapper returns as state observations the indices
of non-zero components in the tile-coded original state observation. For
example if a state observation is tile coded and the tile coded representation
is [1 0 0 1 0 0 0 0 0 0 1 0 1], then the state feature vector returned by
IndexTileCoding is [3 10 12 0] in no particular order except that the
bias unit will alway be the last index if a bias unit is used.
Trackers define what data an Experiment will save to disk. For example,
the trackers.Return Tracker will cause an Experiment to save the episodic
returns seen during the Experiment. The trackers.EpisodeLength Tracker
will cause an Experiment to save the episode lengths during an
Experiment. Before running an Experiment, Trackers can be registered
with the Experiment by passing the required Trackers to the Experiment
constructor. Additionally, the Register() method can be used to
register a Tracker with an Experiment after the Experiment has already
been constructed.
Any struct that implements the Tracker interface can be passed to an
Experiment. For each timestep, the Experiment
will send the latest TimeStep returned from the Environment to
each of the Trackers registered with the Experiment using the Tracker's
Track() method. This will cache the needed data in each Tracker. Once
the Experiment is done, the Save() method can then be called on each
Tracker to save the recorded experimental data to disk. Alternatively, each
Experiment implements a Save() method which will automatically save
all data for each of the Trackers registered with the Experiment. This
is perhaps the easiest way to save experimental data.
The Tracker interface
is:
type Tracker interface {
Track(t ts.TimeStep)
Save()
}Trackers follow the observer-observable design pattern to track data
generated from an experiment and save it later.
If an Environment is wrapped in such a way that the TimeStep returned
contains modified data, but the unmodified data is desired to be saved,
the underlying, wrapped Environemnt can be registered with a Tracker
using the trackers.Register() method. In this way, the data from the
underlying, wrapped Environment will be tracked instead of the data
from the wrapper Environment. For example, if an environment is wrapped
in an wrappers.AverageReward wrapper, then the differential reward is
returned on each timestep. In many cases, we desire to track the return of an
episode, not the differential return. In this case, the underlying, wrapped
Environment can be registered with a Tracker using the
trackers.Register() method. In this way, the underlying, un-modified reward
and the episodic return (rather than the episodic differential return) can be
tracked.
A Checkpointer checkpoints an Agent during an experiment by saving the
Agent to a binary file. The Checkpointer interface is:
type Checkpointer interface {
Checkpoint(ts.TimeStep) error
}Checkpointers can only work with Serializable structs. A struct is
serializable if it implements the Serializable interface:
type Serializable interface {
gob.GobEncoder
gob.GobDecoder
}Currently, the only implemented Checkpointer is an nStep Checkpointer
which checkpoints an Agent every n steps of an agent-environment
interaction. For more information, see the checkpointer package.
Currently, no agents implement the Serializable interface. This will
be added on an as-needed basis.
An experiment.Config outlines what kind of Experiment should be run
and with which Environment and which Agent with which hyperparameters
(defined by the Agent's ConfigList). The Experiment Config holds
an Environment Config and an Agent ConfigList (actually, it
holds a TypedConfigList so that it knows how to JSON unmarshall the
ConfigList). The Experiment Config also has a Type which outlines
what Type of experiment we are running (e.g. an OnlineExp is a
Type of Experiment that runs the Agent online and performs online
evaluation only):
// Config represents a configuration of an experiment.
type Config struct {
Type
MaxSteps uint
EnvConf envconfig.Config
AgentConf agent.TypedConfigList
}To create and run an Experiment, you can use the specific Experiment's
constructor. For example, if we wanted an Online experiment:
agent := ... // Create agent
env := ... // Create environment
maxSteps := ... // Maximum number of steps for the experiment
trackers := ... // Create trackers for the experiment
checkpointers := ... // Create checkpointers for the experiment
exp := NewOnline(env, agent, maxSteps, trackers, checkpointers)
exp.Run() // Run the experiment
exp.Save() // Save the data generatedOr you can use an Experiment Config (these can be JSON serialized
to store and run later):
agentConfig := ... // Create agent configuration list
envConfig := ... // Create environment configuration
maxSteps := ... // Maximum number of steps for the experiment
trackers := ... // Create trackers for the experiment
checkpointers := ... // Create checkpointers for the experiment
expConfig := experiment.Config{
Type: experiment.OnlineExp,
MaxSteps: 1_000_000,
EnvConf: envConfig,
AgentConf: agentConfig,
}
// JSON marshall the expConfig so that we can run in as many times as we
// want to later
var agentConfig int = ... // The index of the agent in the ConfigList to run
var seed uint64 = ... // Seed for the agent and environment
exp := expConfig.CreateExperiment(agentConfig, seed, trackers, checkpointers)
exp.Run() // Run the experiment
exp.Save() // Save the data generatedTo run the program, simply run the main.go file. The program takes two
commandline arguments: a JSON configuration file for an Experiment and
a hyperparameter setting index for the Agent defined in the configuration
file. Example JSON Experiment configuration files are given in the
experiments directory.
An Experiment configuration file describes an Environment for the
Experiment as well as an Agent to run on the environment. For each
possible hyperparameter of the agent, the configuration file lists all
possible values that the user would like to test for that hyperparameter.
When running the program, the user must then specify a hyperparameter index
to use for the Experiment. The program will use that hyperparameter setting
index to get only the relevant hyperparameters from the configuration file,
construct an Agent using those hyperparameters, and then run the Experiment.
Hyperparameter settings indices wrap around, so that if there are 10 hyperparameter
settings, then hyperparameter settings indices 0 - 9 determine each hyperparameter
setting for the first run of the Experiment. Indices 10 - 19 determine each
hyperparameter setting for the second run of the Experiment. In general,
indices 10n - 10(n+1) determine the hyperparameter settings for run
n+1 of the Experiment and indices 10n + m (for m fixed) refer to
sequential runs of hyperparameter setting m of the Agent in the
Experiment.
-
Eventually, it would be nice to have environments and tasks JSON serializable in the same manner as Solvers and InitWFns. This would make the config files super configurable...Instead of using default environment values all the time, we could have configurable environments through the JSON config files.
-
Cartpole SwingUp would be nice to implement
-
Would be nice to have the actions in discrete pendulum determined by min and max discrete actions. E.g. Action i -> (action i / minDiscreteAction) - MinContinuousAction and similarly for max actions. Then, (MaxDiscreteAction * MinDiscreteAction) / 2 would be the 0 (do nothing) action which is the middle action.
-
Readme should mention that all configurations in a ConfigList should be compatible. E.g. if you have 3 hidden layers, then you must have 3 activations, etc.
-
Task AtGoal() -> argument should be Vector or *VecDense
-
Add
gymto theEnvConfigstructs. -
Add
TimeLimittogympackage so that time limits can be altered -
All input nodes should have unique names. Use
gop.Unique().
=== Most Pressing ===
-
VAC still gets NaNs. Problem could be with Multi-dim actions in GaussianTreeMLP. Does VPG also get NaNs??
-
I believe the issue with VAC and VPG is that the Gaussian policy does not clamp the standard deviation. Make sure this is clamped!
-
Use
GoTilefor tile coder -
'tracker' package can be reworked. Instead, proved a bunch of different hooks, which are called at set times. For example, there will be a set of hooks called before the experiment is run, before each episode, before each timestep, after each timestep, after each episode, and after the experiment is done. In such a case, we will need an interface for:
- Pre/post-episode hooks, which is just a function that takes in the agent and environment
- Pre/post-episode hook, which takes in the agent, environment, and first step
- pre/post-timestep hook which takes in a timestep
- Then, the agent and environment interfaces should be extended such that they have an
Infomethod, which will return info tracked by the agent or environment. The hooks can then use these. - Add hook wrappers that will call the hook only every N episodes or N steps
-
Discount factor should be returned by the environment on each timestep, and adapted by the task such that timeouts do not cause episode termination. That is, if an episode ends by the goal being reached, a discount factor of 0 is returned. Otherwise, a discount factor of γ is returned