Day 2: MDPs & Value Functions
• Understanding the building blocks of
Reinforcement Learning
• Your Name
• Date
Agenda
• 1. Markov Decision Processes (MDPs)
• 2. Policies
• 3. Value Functions
• 4. Activity: Compute V(s)
• 5. Project: Set up Hand Simulation
What is an MDP?
• Markov Decision Process = (S, A, P, R, γ)
• S: Set of states
• A: Set of actions
• P(s' | s, a): Transition probabilities
• R(s, a): Reward function
• γ: Discount factor
Markov Property
• The future is independent of the past given
the present
• Formally:
• P(s_{t+1} | s_t, a_t, ..., s_0, a_0) = P(s_{t+1} |
s_t, a_t)
Policy (π)
• Policy: Mapping from states to actions
• Deterministic: π(s) = a
• Stochastic: π(a | s) = P(a | s)
• Goal of RL: Find optimal policy π* that
maximizes expected reward
Value Functions
• State-Value Function V^π(s):
• Expected return from state s following π:
• V^π(s) = E_π [∑ γ^t R(s_t, a_t)]
• Action-Value Function Q^π(s, a):
• Expected return from s, taking action a, then
following π
Bellman Equations
• Value Function (Recursive Form):
• V^π(s) = ∑ π(a | s) ∑ P(s' | s, a) [R(s, a) + γ
V^π(s')]
Visual: Simple MDP Example
• (Include a simple MDP diagram or describe
the structure verbally)
Activity: Calculate V(s)
• Given:
• - 3 States: S1, S2, S3
• - Actions: A1, A2
• - Rewards and transitions shown in
diagram/table
• Task: Compute V(s) for each state under a
fixed policy
Python Activity Preview
• Implement a simple MDP:
• - Define states/actions
• - Define transition matrix and reward table
• - Evaluate a policy using Bellman equation
• Tools: Python, NumPy
Project: Hand Simulation Setup
• Goal: Simulate a hand (e.g., robotic hand, card
game)
• Today:
• - Define state space (e.g., hand positions, grip
strength)
• - Define actions (e.g., open, close, flex)
• - Define reward (e.g., holding object = +1, drop
= -1)
Group Work Prompt
• In teams, sketch out hand sim scenario
• Identify:
• - State space
• - Action space
• - Rewards
• - Transition rules
• Present briefly at the end
What’s Next (Day 3 Preview)
• Solving MDPs:
• - Policy Evaluation
• - Policy Iteration
• - Value Iteration
• Implementing solutions in code
Q&A / Wrap-Up
• Questions?
• Recap today’s key points
• Check-in: Do students feel confident in MDPs
and value functions?