Q-LEARNING
CO-3
AIM
To familiarize students with the concepts of unsupervised machine learning, hierarchical clustering,
distance functions, and data standardization
INSTRUCTIONAL OBJECTIVES
This session is designed to:
1. Two formulations for learning: Inductive and Analytical
2. Perfect domain theories
LEARNING OUTCOMES
At the end of this session, you should be able to:
1. Hierarchical clustering and its types
2. Agglomerative clustering
3. Measuring the distance of two clusters
4. Data standardization techniques
Q-Function
• One approach to RL is then to try to estimate V*(s).
• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is
exploring mars and decides to take a right turn?
• Fortunately, we can circumvent this problem by exploring and
experiencing how the world reacts to our actions. We need to learn r &
delta.
• We want a function that directly learns good state-action pairs, i.e., what
action should I take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without
knowing r(s,a) and delta(s,a). We have:
* (s ) argmax Q (s , a)
a
V * (s ) max Q (s , a )
a
Example II
Check that
* (s ) argmax Q (s , a)
a
V * (s ) max Q (s , a )
a
Q-Learning
• This still depends on r(s , a) and delta(s , a).
• However, imagine the robot is exploring its environment, trying new actions as it
goes.
• At every step it receives some reward “r”, and it observes the environment change
into a new state s’ for action a.
• How can we use these observations, (s, a, s’,r) to learn a model?
s’=st+1
Q-Learning
• This equation continually estimates Q at state s consistent with an estimate
of Q at state s’, one step in the future: temporal difference (TD) learning.
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• Updating estimates based on other estimates is called bootstrapping.
• We do an update after each state-action pair. I.e., we are learning online!
• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
Example Q-Learning
Qˆ(s1, aright ) r max Qˆ(s2 , a ' )
a'
0 0.9 max{66,81,100}
90
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current
policy when learning Q. (off-policy learning).The reason is that you may
get stuck in a suboptimal solution. I.e., there may be other solutions
out there that you have never seen.
• Hence it is good to try new things so now and then, e.g.
If T large lots of exploring, if T small follow current policy. One can
decrease
T over time ˆ
P (a | s ) eQ (s ,a ) / T
Improvements
• One can trade-off memory and computation by cashing (s,s’,r)
for observed transitions. After a while, as Q(s’,a’) has changed,
you can “replay” the update:
• One can actively search for state-action pairs for which Q(s,a)
is expected to change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back
than just one step ( TD( ) learning).
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function: Q (s , a ) f (s , a )
• Neural network with back-propagation have been quite successful.
• For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very
large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda)
for learning.
More on Function Approximation
• For instance: linear function:
• The features Phi are fixed measurements of the state (e.g., # stones
on the board).
• We only learn the parameters theta.
• Update rule: (start in state s, take action a, observe reward r and end
up in state s’)
Conclusion
• Reinforcement learning addresses a very broad and relevant question:
• How can we learn to survive in our environment?
• We have looked at Q-learning, which simply learn s from experience.
• No model of the world is needed.
• We made simplifying assumptions: e.g., state of the world only
depends on last state and action. This is the Markov assumption. The
model is called a Markov Decision Process (MDP).
• We assumed deterministic dynamics, reward function, but the world
really is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real-world applications.
Applications of
Reinforcement Learning
• Robotics for industrial automation.
• Business strategy planning
• Machine learning and data processing
• It helps you to create training systems that provide
custom instruction and materials according to the
requirement of students.
• Aircraft control and robot motion control
• Traffic Light Control
• A robot cleaning room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
THANK YOU
TEAM ML