Thanks to visit codestin.com
Credit goes to github.com

Skip to content

StG999/Deep-QLearning-Lunar-Lander

Repository files navigation

The Lunar Lander Environment

In this project, I've used OpenAI's Gym Library. The Gym library provides a wide variety of environments for reinforcement learning. To put it simply, an environment represents a problem or task to be solved. Here, I have optimized the Lunar Lander environment problem using reinforcement learning.

The goal of the Lunar Lander environment is to land the lunar lander safely on the landing pad on the surface of the moon. The landing pad is designated by two flag poles and it is always at coordinates (0,0) but the lander is also allowed to land outside of the landing pad. The lander starts at the top center of the environment with a random initial force applied to its center of mass and has infinite fuel. The environment is considered solved if you get 200 points.



Lunar Lander Environment.

Action Space

The agent has four discrete actions available:

  • Do nothing.
  • Fire right engine.
  • Fire main engine.
  • Fire left engine.

Each action has a corresponding numerical value:

Do nothing = 0
Fire right engine = 1
Fire main engine = 2
Fire left engine = 3

Observation Space

The agent's observation space consists of a state vector with 8 variables:

  • Its $(x,y)$ coordinates. The landing pad is always at coordinates $(0,0)$.
  • Its linear velocities $(\dot x,\dot y)$.
  • Its angle $\theta$.
  • Its angular velocity $\dot \theta$.
  • Two booleans, $l$ and $r$, that represent whether each leg is in contact with the ground or not.

Rewards

The Lunar Lander environment has the following reward system:

  • Landing on the landing pad and coming to rest is about 100-140 points.
  • If the lander moves away from the landing pad, it loses reward.
  • If the lander crashes, it receives -100 points.
  • If the lander comes to rest, it receives +100 points.
  • Each leg with ground contact is +10 points.
  • Firing the main engine is -0.3 points each frame.
  • Firing the side engine is -0.03 points each frame.

Episode Termination

An episode ends (i.e the environment enters a terminal state) if:

  • The lunar lander crashes (i.e if the body of the lunar lander comes in contact with the surface of the moon).

  • The lander's $x$-coordinate is greater than 1.

Deep Q-Learning

Used as the backbone of the model

In cases where both the state and action space are discrete we can estimate the action-value function iteratively by using the Bellman equation:

$$ Q_{i+1}(s,a) = R + \gamma \max_{a'}Q_i(s',a') $$

This iterative method converges to the optimal action-value function $Q^(s,a)$ as $i\to\infty$. This means that the agent just needs to gradually explore the state-action space and keep updating the estimate of $Q(s,a)$ until it converges to the optimal action-value function $Q^(s,a)$. However, in cases where the state space is continuous it becomes practically impossible to explore the entire state-action space. Consequently, this also makes it practically impossible to gradually estimate $Q(s,a)$ until it converges to $Q^*(s,a)$.

In the Deep $Q$-Learning, we solve this problem by using a neural network to estimate the action-value function $Q(s,a)\approx Q^*(s,a)$. We call this neural network a $Q$-Network and it can be trained by adjusting its weights at each iteration to minimize the mean-squared error in the Bellman equation.

Unfortunately, using neural networks in reinforcement learning to estimate action-value functions has proven to be highly unstable. Luckily, there's a couple of techniques that can be employed to avoid instabilities. These techniques consist of using a Target Network and Experience Replay.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published