In this project, I've used OpenAI's Gym Library. The Gym library provides a wide variety of environments for reinforcement learning. To put it simply, an environment represents a problem or task to be solved. Here, I have optimized the Lunar Lander environment problem using reinforcement learning.
The goal of the Lunar Lander environment is to land the lunar lander safely on the landing pad on the surface of the moon. The landing pad is designated by two flag poles and it is always at coordinates (0,0) but the lander is also allowed to land outside of the landing pad. The lander starts at the top center of the environment with a random initial force applied to its center of mass and has infinite fuel. The environment is considered solved if you get 200 points.
Lunar Lander Environment.
The agent has four discrete actions available:
- Do nothing.
- Fire right engine.
- Fire main engine.
- Fire left engine.
Each action has a corresponding numerical value:
Do nothing = 0
Fire right engine = 1
Fire main engine = 2
Fire left engine = 3The agent's observation space consists of a state vector with 8 variables:
- Its
$(x,y)$ coordinates. The landing pad is always at coordinates$(0,0)$ . - Its linear velocities
$(\dot x,\dot y)$ . - Its angle
$\theta$ . - Its angular velocity
$\dot \theta$ . - Two booleans,
$l$ and$r$ , that represent whether each leg is in contact with the ground or not.
The Lunar Lander environment has the following reward system:
- Landing on the landing pad and coming to rest is about 100-140 points.
- If the lander moves away from the landing pad, it loses reward.
- If the lander crashes, it receives -100 points.
- If the lander comes to rest, it receives +100 points.
- Each leg with ground contact is +10 points.
- Firing the main engine is -0.3 points each frame.
- Firing the side engine is -0.03 points each frame.
An episode ends (i.e the environment enters a terminal state) if:
-
The lunar lander crashes (i.e if the body of the lunar lander comes in contact with the surface of the moon).
-
The lander's
$x$ -coordinate is greater than 1.
In cases where both the state and action space are discrete we can estimate the action-value function iteratively by using the Bellman equation:
This iterative method converges to the optimal action-value function $Q^(s,a)$ as $i\to\infty$. This means that the agent just needs to gradually explore the state-action space and keep updating the estimate of $Q(s,a)$ until it converges to the optimal action-value function $Q^(s,a)$. However, in cases where the state space is continuous it becomes practically impossible to explore the entire state-action space. Consequently, this also makes it practically impossible to gradually estimate
In the Deep
Unfortunately, using neural networks in reinforcement learning to estimate action-value functions has proven to be highly unstable. Luckily, there's a couple of techniques that can be employed to avoid instabilities. These techniques consist of using a Target Network and Experience Replay.