Reinforcement Learning
G.Prethija, SCOPE, VIT-Chennai
Supervised vs Unsupervised vs Reinforcement Learning
Reinforcement Learning
• Reinforcement Learning (RL) is a machine learning approach inspired by behaviorist
psychology and, in particular, the way humans and animals learn to take decisions via (positive or
negative) rewards received by their environment.
• An agent learns to make decisions by interacting with an environment. The agent takes actions,
observes the results, and receives feedback in the form of rewards or penalties. Over time, the agent
aims to maximize its cumulative reward by learning an optimal strategy or policy.
• Semi supervised learning(reward=time delayed labels, labels are rare)
• Reinforcement Learning is a family of algorithms and techniques used for Control (e.g. Robotics,
Autonomous driving, etc..) and decision making
Reinforcement Learning- Applications
Reinforcement Learning
•Agent:
•The learner or decision-maker (e.g., a robot, game character)
•Takes action
•Environment:
•The external system the agent interacts with (e.g., game
world, real world).
•State: A representation of the current situation the agent is in,
based on the environment (e.g., player position).
•Action: Choices the agent can make at any given time (e.g., move
left, right, jump).
•Reward: Feedback from the environment based on the action
taken, which can be positive (reward) or negative (penalty).
•Policy: A strategy the agent follows to decide which actions to take
in different states.
•Value Function: A measure of the expected long-term reward for a
state or a state-action pair.
•Q-value: Represents the expected future reward for taking a
specific action in a given state, used in algorithms like Q-learning.
Reinforcement Learning-Use cases
Robot Ball-In-A-Cup https://www.youtube.com/watch?v=qtqubguikMk
Reinforcement Learning https://www.youtube.com/watch?v=b2PxUslKZm4
for Robot Navigation
Unitree Go2 & B2 robotic dog https://www.youtube.com/watch?v=g6NfGuV0IVE
Reinforcement Learning-Use cases
State: position or cell
Action :Move up, right, left, down
Reward: positive or negative
The mouse may get the cheese at the end
Reward is sparse(rare)
Reinforcement Learning
• Design a policy of what
actions to be taken for
state s to maximize the
chance of getting future
rewards
• Environment is
probabilistic, therefore
policy is also probabilistic
Reinforcement Learning
How much award I get
in the future
Reinforcement Learning-How to train AI to Play the Snake Game
On the left, AI does not know anything about the game. On the right, the AI is trained and learnt how to play.
Reinforcement Learning-How to train AI to Play the Snake Game
• set of states S ( an index based on Snake’s position)
• set of actions A (Up, Down, Right, Left)
• a reward function R (+10 when Snake eats an apple, -10 when Snakes hits a wall)
• environment (our game)
• agent (our Snake i.e., Deep Neural Network that drives our Snake’s actions)
Every time the agent performs an action, the environment gives a reward to the agent, which can be
positive or negative depending on how good the action was from that specific state.
The goal of the agent is to learn what actions maximize the reward, given every possible state.
States are the observations that the agent receives at each iteration from the environment. A state can
be its position, its speed, or whatever array of variables describes the environment.
To be more rigorous and to use a Reinforcement Learning notation, the strategy used by the agent to
make decisions is called policy.
Reinforcement Learning-How to train AI to Play the Snake Game
• To understand how the agent takes decisions, we need to know what a Q-Table is.
• A Q-table is a matrix that correlates the state of the agent with the possible actions that the agent
can adopt. The values in the table are the action’s probability of success (technically, a measure of
the expected cumulative reward), which were updated based on the rewards the agent received
during training.
• An example of a greedy policy is a policy where the agent looks up the table and selects the action
that leads to the highest score.
This table is the policy of the agent
that we mentioned before:
it determines what actions should be taken
from every state to maximize the expected
reward
Demerit: finite state space
Reinforcement Learning-How to train AI to Play the Snake Game
Deep Q-Learning increases the potentiality of Q-Learning by converting
the table into Deep Neural Network — that is a powerful representation of
a parametrized function. The Q-values are updated according to the
Bellman equation:
Reinforcement Learning-How to train AI to Play the Snake Game
Algorithm
•The game starts, and the Q-value is randomly initialized.
•The agent collects the current state s (the observation).
•The agent executes an action based on the collected state. The action can either be
random or returned by its neural network. During the first phase of the training, the
system often chooses random actions to maximize exploration. Later on, the system
relies more and more on its neural network.
•When the AI chooses and performs the action, the environment gives a reward to
the agent. Then, the agent reaches the new state state’ and it updates its Q-value
according to the Bellman equation as mentioned above. Also, for each move, it stores
the original state, the action, the state reached after performed that action, the reward
obtained and whether the game ended or not. This data is later sampled to train the
neural network. This operation is called Replay Memory.
•These last two operations are repeated until a certain condition is met
References
• Artificial Intelligence and Games, Georgios N. Yannakakis and Julian Togelius,
January 26, 2018, Springer
• https://towardsdatascience.com/how-to-teach-an-ai-to-play-games-deep-
reinforcement-learning-28f9b920440a
• https://www.youtube.com/watch?v=0MNVhXEX9to
• https://www.youtube.com/watch?v=AhyznRSDjw8