RL Unit1 Notes
RL Unit1 Notes
IV-I SEM
CSE (AI&ML)
UNIT-I-NOTES
Exploration: Exploration refers to the process of trying out different actions or choices
that an agent hasn't yet experienced in order to gather information about their potential
outcomes. It involves taking actions that might not be immediately optimal or known to
the agent, with the aim of discovering new knowledge about the environment and
improving the agent's understanding of the rewards associated with different actions.
Exploration is crucial for learning the underlying dynamics of the environment and
discovering optimal strategies over time.
Exploitation: Exploitation involves making decisions that are currently believed to be the
best based on the agent's current knowledge and experience. It refers to the act of
selecting actions that are expected to yield the highest immediate reward, based on the
agent's existing understanding of the environment. Exploitation is important for
maximizing short-term gains and making optimal decisions based on what is currently
known.
In practice, this might look like a gambler trying out different slot machines to see which
one pays out the most (exploration), while also sticking with a machine that has paid out
well in the past (exploitation).
In a Contextual Bandit setting, each decision point is associated with a context, which
represents relevant information or features that describe the current state or context of the
environment. The goal of the agent is to learn a policy that maps these context features to
the best action to take, considering the expected rewards associated with each action in
the given context. The agent's objective is to strike a balance between exploring different
actions to learn their associated rewards and exploiting its current knowledge to
maximize cumulative rewards.
7. Define Markov Property?
Answer:
Markov property is said to be a Markov decision process (MDP). With an MDP, the current state
alone contains enough information to choose optimal actions to maximize future rewards.
Modeling a control task as an MDP is a key concept in reinforcement learning.
LONG-ANSWER QUESTIONS
1. Write about Dynamic programming versus Monte Carlo method?
Answer:
Dynamic Programming (DP) and Monte Carlo (MC) are two fundamental approaches
used in reinforcement learning to solve Markov decision processes (MDPs) and optimize
decision-making strategies. Both methods aim to find optimal policies that maximize the
cumulative reward over time. However, they have different characteristics and are suited
for different types of problems. Let's compare Dynamic Programming and Monte Carlo
methods in reinforcement learning:
Principle: MC methods are model-free approaches that estimate the value function or the
policy by simulating episodes of interaction with the environment and averaging the
observed returns.
Assumption: MC methods do not require prior knowledge of the transition probabilities
and reward functions. They learn directly from interaction with the environment.
Policy Evaluation: MC methods estimate the value function by averaging the returns
obtained from different trajectories under a given policy.
Exploration: MC methods inherently explore by interacting with the environment and
collecting data.
Variance and Bias: MC methods can have higher variance due to the randomness of the
observed returns, but they are unbiased estimators of the value function.
Sample Efficiency: MC methods may require a larger number of episodes to converge to
the optimal solution compared to DP, especially in complex environments.
Batch and Online Learning: MC methods can be naturally applied in online learning
settings, where data is collected in an ongoing manner.
In summary, Dynamic Programming is more suitable for small and well-defined MDPs
where complete knowledge of the environment is available, while Monte Carlo methods
are effective when the MDP is unknown or too complex to model accurately. MC
methods excel in online learning scenarios where interaction with the environment is
feasible. Both DP and MC have their strengths and weaknesses, and the choice between
them depends on the characteristics of the problem and the available information about
the environment.
The input data is generated by the environment. In general, the environment of a RL (or
control) task is any dynamic process that produces data that is relevant to achieving our
objective. Although we use “environment” as a technical term, it’s not too far abstracted
from its everyday usage. As an instance of a very advanced RL algorithm yourself, you
are always in some environment, and your eyes and ears are constantly consuming
information produced by your environment so you can achieve your daily objectives.
Since the environment is a dynamic process (a function of time), it may be producing a
continuous stream of data of varied size and type. To make things algorithm-friendly, we
need to take this environment data and bundle it into discrete packets that we call the
state (of the environment) and then deliver it to our algorithm at each of its discrete time
steps. The state reflects our knowledge of the environment at some particular time, just as
a digital camera captures a discrete snapshot of a scene at some time (and produces a
consistently formatted image).
To summarize so far, we defined an objective function (minimize costs by optimizing
temperature) that is a function of the state (current costs, current temperature data) of the
environment (the data center and any related processes). The last part of our model is the
RL algorithm itself. This could be any parametric algorithm that can learn from data to
minimize or maximize some objective function by modifying its parameters. It does not
need to be a deep learning algorithm; RL is a field of its own, separate from the concerns
of any particular learning algorithm.
As we noted before, one of the key differences between RL (or control tasks generally)
and ordinary supervised learning is that in a control task the algorithm needs to make
decisions and take actions. These actions will have a causal effect on what happens in the
future. Taking an action is a keyword in the framework, and it means more or less what
you’d expect it to mean. However, every action taken is the result of analyzing the
current state of the environment and attempting to make the best decision based on that
information.
The last concept in the RL framework is that after each action is taken, the algorithm is
given a reward. The reward is a (local) signal of how well the learning algorithm is
performing at achieving the global objective. The reward can be a positive signal (i.e.,
doing well, keep it up) or a negative signal (i.e., don’t do that) even though we call both
situations a “reward.”
The reward signal is the only cue the learning algorithm has to go by as it updates itself in
hopes of performing better in the next state of the environment. In our data center
example, we might grant the algorithm a reward of +10 (an arbitrary value) whenever its
action reduces the error value. Or more reasonably, we might grant a reward proportional
to how much it decreases the error. If it increases the error, we would give it a negative
reward.
Lastly, let’s give our learning algorithm a fancier name, calling it the agent. The agent is
the action-taking or decision-making learning algorithm in any RL problem. We can put
this all together as shown in figure.
Figure: The standard framework for RL algorithms. The agent takes an action in the
environment, such as moving a chess piece, which then updates the state of the
environment. For every action it takes, it receives a reward (e.g., +1 for winning the
game, –1 for losing the game, 0 otherwise). The RL algorithm repeats this process with
the objective of maximizing rewards in the long term, and it eventually learns how the
environment works.
RL research and applications are still maturing, but there have been many exciting
developments in recent years. Google’s DeepMind research group has showcased some
impressive results and garnered international attention. The first was in 2013 with an
algorithm that could play a spectrum of Atari games at superhuman levels. Previous
attempts at creating agents to solve these games involved fine-tuning the underlying
algorithms to understand the specific rules of the game, often called feature engineering.
These feature engineering approaches can work well for a particular game, but they are
unable to transfer any knowledge or skills to a new game or domain. DeepMind’s deep
Q-network (DQN) algorithm was robust enough to work on seven games without any
game-specific tweaks. It had nothing more than the raw pixels from the screen as input
and was merely told to maximize the score, yet the algorithm learned how to play beyond
an expert human level.
More recently, DeepMind’s AlphaGo and AlphaZero algorithms beat the world’s best
players at the ancient Chinese game Go. Experts believed artificial intelligence would not
be able to play Go competitively for at least another decade because the game has
characteristics that algorithms typically don’t handle well. Players do not know the best
move to make at any given turn and only receive feedback for their actions at the end of
the game. Many high-level players saw themselves as artists rather than calculating
strategists and described winning moves as being beautiful or elegant. With over 10170
legal board positions, brute force algorithms (which IBM’s Deep Blue used to win at
chess) were not feasible. AlphaGo managed this feat largely by playing simulated games
of Go millions of times and learning which actions maximized the rewards of playing the
game well. Similar to the Atari case, AlphaGo only had access to the same information a
human player would: where the pieces were on the board.
While algorithms that can play games better than humans are remarkable, the promise
and potential of RL goes far beyond making better game bots. DeepMind was able to
create a model to decrease Google’s data center cooling costs by 40%, something we
explored earlier in this chapter as an example. Autonomous vehicles use RL to learn
which series of actions (accelerating, turning, breaking, signaling) leads to passengers
reaching their destinations on time and to learn how to avoid accidents. And researchers
are training robots to complete tasks, such as learning to run, without explicitly
programming complex motor skills.
Many of these examples are high stakes, like driving a car. You cannot just let a learning
machine learn how to drive a car by trial and error. Fortunately, there are an increasing
number of successful examples of letting learning machines loose in harmless simulators,
and once they have mastered the simulator, letting them try real hardware in the real
world. One instance that we will explore in this book is algorithmic trading. A substantial
fraction of all stock trading is executed by computers with little to no input from human
operators. Most of these algorithmic traders are wielded by huge hedge funds managing
billions of dollars. In the last few years, however, we’ve seen more and more interest by
individual traders in building trading algorithms. Indeed, Quantopian provides a platform
where individual users can write trading algorithms in Python and test them in a safe,
simulated environment. If the algorithms perform well, they can be used to trade real
money. Many traders have achieved relative success with simple heuristics and rule-
based algorithms. However, equity markets are dynamic and unpredictable, so a
continuously learning RL algorithm has the advantage of being able to adapt to changing
market conditions in real time.
One practical problem we’ll tackle early in this book is advertisement placement. Many
web businesses derive significant revenue from advertisements, and the revenue from ads
is often tied to the number of clicks those ads can garner. There is a big incentive to place
advertisements where they can maximize clicks. The only way to do this, however, is to
use knowledge about the users to display the most appropriate ads. We generally don’t
know what characteristics of the user are related to the right ad choices, but we can
employ RL techniques to make some headway. If we give an RL algorithm some
potentially useful information about the user (what we would call the environment, or
state of the environment) and tell it to maximize ad clicks, it will learn how to associate
its input data to its objective, and it will eventually learn which ads will produce the most
clicks from a particular user.
1. Actions (Arms): The set of choices or actions available to the agent. Each action
corresponds to an "arm" of the bandit.
3. Exploration: The agent needs to balance exploration (trying out different actions to
gather information) and exploitation (choosing actions with the highest estimated
rewards based on available information).
4. Cumulative Reward: The agent aims to maximize its cumulative reward over a series
of interactions by making informed decisions about which action to select at each
step.
There are several strategies and algorithms used to address the multi-arm bandit problem,
each with its own approach to balancing exploration and exploitation. Here are some
common solutions for the multi-arm bandit problem:
Epsilon-Greedy Algorithm:
This is a simple and widely used approach. The agent selects the arm with the highest
estimated reward with probability (1-ε), and explores a random arm with probability ε.
Over time, the algorithm tends to converge to the optimal arm while still exploring other
arms.
EXP3 Algorithm:
This algorithm combines the Explore-then-Commit strategy with the UCB algorithm. It
explores a subset of arms initially and then exploits the best arm based on the exploration
results.
Softmax solution:
The softmax solution is a popular approach to solving the multi-arm bandit problem that
involves using the softmax function to model the agent's exploration and exploitation
behavior. This approach is also known as the "Boltzmann exploration" strategy or the
"soft-max" algorithm. It provides a probabilistic way to select arms based on their
estimated rewards.
Contextual Bandits:
In this extension of the problem, additional contextual information about the arms is
provided. The agent learns a mapping from context to actions, allowing for more
personalized and efficient decision-making.
In practice, this might look like a gambler who usually plays the slot machine that has
paid out the most in the past (the "greedy" part), but occasionally tries out a different
machine just to see if it might pay out even more (the "epsilon" part).
1. Choose a value for epsilon (the "exploration rate"). This value determines how often
the agent will choose a random action instead of the best known action. A common value
for epsilon is 0.1, which means that the agent will choose a random action 10% of the
time.
2. At each time step, generate a random number between 0 and 1. If the random number
is less than epsilon, choose a random action. Otherwise, choose the action with the
highest estimated value (the "greedy" part).
3. Update the estimated value of the chosen action based on the reward received.
4. Repeat steps 2-3 for a fixed number of time steps or until convergence.
To solve this problem using the epsilon-greedy strategy, the agent would start by
randomly choosing one of the slot machines. After each play, the agent would update the
estimated value of the chosen machine based on the reward received. Then, at each
subsequent play, the agent would choose the machine with the highest estimated value
with probability 1-epsilon, and a random machine with probability epsilon. Over time,
the agent would converge on the machine with the highest payout probability, while still
occasionally trying out other machines to see if they might be better.
Here is a more detailed example of how the epsilon-greedy strategy can be used to solve
a multi-armed bandit problem:
Suppose we have a gambler who is trying to maximize their winnings by playing one of
several slot machines. Each machine has an unknown payout probability, and the gambler
must choose which machine to play at each time step. The goal is to maximize the total
payout over a fixed number of time steps.
To solve this problem using the epsilon-greedy strategy, the gambler would start by
randomly choosing one of the slot machines. After each play, the gambler would
update the estimated value of the chosen machine based on the reward received. For
example, if the gambler played machine A and won 10 coins, they might update the
estimated value of machine A to be 10.
Then, at each subsequent play, the gambler would choose the machine with the highest
estimated value with probability 1-epsilon, and a random machine with probability
epsilon. For example, if the gambler had estimated values of 10 for machine A, 5 for
machine B, and 3 for machine C, and epsilon was set to 0.1, the gambler would choose
machine A with probability 0.9, machine B with probability 0.05, and machine C with
probability 0.05.
Over time, the gambler would converge on the machine with the highest payout
probability, while still occasionally trying out other machines to see if they might be
better. For example, if machine A had a higher payout probability than the other
machines, the gambler would eventually start playing machine A more often than the
other machines, but would still occasionally try out the other machines to see if their
estimated values had changed.
The epsilon-greedy strategy is a simple but effective way to balance exploration and
exploitation in the multi-armed bandit problem. By occasionally trying out new machines,
the gambler can discover better options that might not have been found otherwise. At the
same time, by mostly sticking with the machines that have already paid out well, the gambler
can avoid wasting time on machines that are unlikely to pay off.
In practice, this might look like a gambler who assigns probabilities to each slot machine
based on how much they have paid out in the past. The machine that has paid out the
most might be assigned a probability of 0.8, while the machine that has paid out the least
might be assigned a probability of 0.1. The gambler would then choose a machine at
random based on these probabilities.
The soft-max selection policy is often used in reinforcement learning because it provides
a more nuanced approach to balancing exploration and exploitation. By assigning
probabilities to each action, the agent can explore new options while still favoring actions
that are known to be good. Additionally, the soft-max selection policy can provide
information about the relative values of different actions, which can be useful in
situations where the agent needs to make more complex decisions.
To implement the soft-max selection policy, we first need to estimate the value of each
possible action. This can be done using a variety of methods, such as Monte Carlo
simulation or temporal difference learning. Once we have estimated the value of each
action, we can use the soft-max function to assign probabilities to each action. The soft-
max function is defined as:
To solve this problem using the soft-max selection policy, the doctor would start by
estimating the value of each treatment based on previous outcomes. For example, if
treatment A had saved 8 out of 10 patients, while treatment B had saved 6 out of 10
patients, the doctor might estimate the value of treatment A to be higher than the value of
treatment B.
Then, at each subsequent patient encounter, the doctor would use the soft-max function to
assign probabilities to each treatment based on their estimated values. For example, if
treatment A had an estimated value of 0.8 and treatment B had an estimated value of 0.6,
the doctor might assign a probability of 0.73 to treatment A and a probability of 0.27 to
treatment B (assuming a temperature parameter of 1).
Over time, the doctor would converge on the treatment with the highest estimated value,
while still occasionally trying out other treatments to see if they might be better. By using
the soft-max selection policy, the doctor can explore different treatment options while
still favoring treatments that are known to be effective. Additionally, the soft-max
function provides information about the relative values of different treatments, which can
be useful in situations where the doctor needs to make more complex decisions.
To solve this problem using the soft=max selection policy, the gambler would start by
estimating the value of each machine based on previous outcomes. For example, if
machine A had paid out 8 coins on average, while machine B had paid out 6 coins on
average, the gambler might estimate the value of machine A to be higher than the value
of machine B.
Then, at each subsequent play, the gambler would use the soft-max function to assign
probabilities to each machine based on their estimated values. For example, if machine A
had an estimated value of 0.8 and machine B had an estimated value of 0.6, the gambler
might assign a probability of 0.73 to machine A and a probability of 0.27 to machine B
(assuming a temperature parameter of 1).
Over time, the gambler would converge on the machine with the highest payout probability,
while still occasionally trying out other machines to see if they might be better. By using the
softmax selection policy, the gambler can explore different machines while still favoring
machines that are known to pay out well. Additionally, the soft-max function provides
information about the relative values of different machines, which can be useful in situations
where the gambler needs to make more complex decisions.
The basic idea is to treat the ad placement problem as a multi-armed bandit problem,
where each "arm" corresponds to a different ad placement strategy. The algorithm then
tries different strategies and learns which ones are most effective over time.
One popular approach is to use softmax action selection, which computes probabilities
for each arm based on their current action values. The algorithm then chooses an arm
randomly but weighted by these probabilities. This approach tends to converge faster on
the maximal mean reward compared to the epsilon-greedy method.
Of course, in practice, there are many additional complexities to consider, such as the fact
that the best ad placement may depend on the context (e.g. which website the user is on).
This is where contextual bandits come in, which add a new layer of complexity to the
problem.
There are many companies that use bandit algorithms for ad placement optimization. One
example is Google, which uses a bandit algorithm called "AdWords" to optimize ad
placement on its search engine results pages.
In addition, many other companies in the advertising industry use bandit algorithms to
optimize ad placement, including Facebook, Amazon, and many others.
Overall, bandit algorithms have become an important tool for companies looking to
maximize the effectiveness of their ad placements and improve their return on
investment.
PyTorch and NumPy are both popular libraries for numerical computing in Python, but
they have some key differences. Here are a few:
1. Tensors vs arrays: In PyTorch, the basic data structure is the tensor, which is similar to
a NumPy array but with some additional features. Tensors can be used to represent multi-
dimensional arrays of numerical data, and they can be easily moved between CPU and
GPU memory for efficient computation.
3. GPU acceleration: PyTorch includes built-in support for GPU acceleration, which can
greatly speed up numerical computations. NumPy can also be used with GPUs, but it
requires additional libraries (such as CuPy) to do so.
Overall, PyTorch and NumPy are both powerful libraries for numerical computing, but
they have different strengths and weaknesses depending on the task at hand. PyTorch is
particularly well-suited for machine learning tasks that require automatic differentiation
and GPU acceleration, while NumPy is more general-purpose and can be used for a wide
range of numerical computations.
PyTorch is a popular library for building deep learning models in Python. Here are some
of the key steps involved in building models with PyTorch:
1. Define the model architecture: The first step in building a PyTorch model is to define
its architecture. This involves specifying the number and type of layers in the model, as
well as any activation functions or other operations that should be applied to the data.
2. Define the loss function: The next step is to define the loss function, which is used to
measure how well the model is performing. This will depend on the specific task you are
trying to solve (e.g. classification, regression, etc.).
3. Define the optimizer: The optimizer is used to update the model parameters during
training in order to minimize the loss function. PyTorch includes a variety of built-in
optimizers, such as stochastic gradient descent (SGD), Adam, and RMSprop.
4. Train the model: Once the model architecture, loss function, and optimizer have been
defined, you can begin training the model. This involves feeding training data into the
model, computing the loss, and using the optimizer to update the model parameters.
5. Evaluate the model: After the model has been trained, you can evaluate its
performance on a separate validation or test dataset. This will give you an idea of how
well the model is likely to perform on new, unseen data.
6. Fine-tune the model: Depending on the results of the evaluation, you may need to fine-
tune the model by adjusting its architecture, hyperparameters, or other settings. This
process may involve repeating steps 1-5 multiple times until you achieve satisfactory
results.
Overall, building models with PyTorch involves a combination of defining the model
architecture, loss function, and optimizer, as well as training and evaluating the model to
achieve the best possible performance on your specific task.
Solving contextual bandits involves finding the optimal policy for selecting actions in a
multi-armed bandit problem where the rewards for each action may depend on some
context or state information.
In the case of contextual bandits for advertisement placement, the goal is to find the best
ad to place for a given user context (such as the website they are currently on). This
involves selecting an action (i.e. an ad) that maximizes the expected reward, taking into
account the context information.
One approach to solving contextual bandits is to use a model-based approach, where the
agent learns a model of the reward function based on the context information and uses
this model to select actions. Another approach is to use a model-free approach, where the
agent learns the optimal policy directly from experience, without explicitly modeling the
reward function.
In the case of model-free approaches, one popular algorithm for solving contextual
bandits is the contextual bandit algorithm, which uses a neural network to estimate the
expected reward for each action given the context information. The network is trained
using a variant of stochastic gradient descent, where the loss function is the negative log-
likelihood of the chosen action given the context.
Overall, solving contextual bandits involves finding the best strategy for selecting actions
in a multi-armed bandit problem where the rewards depend on some context or state
information. This can be done using a variety of model-based or model-free approaches,
depending on the specific problem and available data.
The downside to using softmax action selection for ad placement optimization is that it
can be overly confident in its predictions, leading to suboptimal results. This is because
softmax action selection assigns probabilities to each action based on their expected
rewards, and the action with the highest probability is selected. However, if the expected
rewards are uncertain or noisy, this can lead to the algorithm being overly confident in its
predictions and selecting suboptimal actions.
The bandit algorithm works in the context of ad placement optimization by selecting the
best ad to display to a user based on their context (such as the website they are currently
on). The algorithm learns from past user interactions to improve its predictions over time.
Specifically, the algorithm maintains a probability distribution over the possible ads to
display, and selects an ad to display based on this distribution. After the ad is displayed,
the algorithm updates its probability distribution based on the observed reward (i.e.
whether the user clicked on the ad or not).
There are many real-world examples of companies using bandit algorithms for ad
placement optimization. For example, Google uses a bandit algorithm to optimize ad
placements on its search results pages. The algorithm selects the best ad to display based
on the user's search query and other contextual information. Similarly, Facebook uses a
bandit algorithm to optimize ad placements in its news feed. The algorithm selects the
best ad to display based on the user's interests and other contextual information. Other
companies that use bandit algorithms for ad placement optimization include Amazon,
Netflix, and Microsoft.
The MDP model simplifies an RL problem dramatically, as we do not need to take into
account all previous states or actions—we don’t need to have memory, we just need to
analyse the present situation. Hence, we always attempt to model a problem as (at least
approximately) a Markov decision processes. The card game Blackjack (Also known as
21) is an MDP because we can play the game successfully just by knowing our current
state (what cards we have, and the dealer’s one face-up card).
The Markov Property state that: “Future is Independent of the past given the present”
Mathematically we can express this statement as:
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this
equation means is that the transition from state S[t] to S[t+1] is entirely independent of
the past. So, the RHS of the Equation means the same as LHS if the system has a Markov
Property. Intuitively meaning that our current state already captures the information of
the past states.
To test your understanding of the Markov property, consider each control problem or
decision task in the following list and see if it has the Markov property or not:
Driving a car
Deciding whether to invest in a stock or not
Choosing a medical treatment for a patient
Diagnosing a patient’s illness
Predicting which team will win in a football game
Choosing the shortest route (by distance) to some destination
Aiming a gun to shoot a distant target
Driving a car can generally be considered to have the Markov property because you
don’t need to know what happened 10 minutes ago to be able to optimally drive your
car. You just need to know where everything is right now and where you want to go.
Deciding whether to invest in a stock or not does not meet the criteria of the Markov
property since you would want to know the past performance of the stock in order to
make a decision.
Choosing a medical treatment seems to have the Markov property because you don’t
need to know the biography of a person to choose a good treatment for what ails them
right now.
Predicting which football team will win does not have the Markov property, since,
like the stock example, you need to know the past performance of the football teams
to make a good prediction.
Choosing the shortest route to a destination has the Markov property because you just
need to know the distance to the destination for various routes, which doesn’t depend
on what happened yesterday.
Aiming a gun to shoot a distant target also has the Markov property since all you need
to know is where the target is, and perhaps current conditions like wind velocity and
the particulars of your gun. You don’t need to know the wind velocity of yesterday.
DeepMind’s deep Q-learning (or deep Q-network) algorithm learned to play Atari games
from just raw pixel data and the current score. Do Atari games have the Markov
property? Not exactly. In the game Pacman, if our state is the raw pixel data from our
current frame, we have no idea if the enemy a few tiles away is approaching us or moving
away from us, and that would strongly influence our choice of actions to take. This is
why DeepMind’s implementation actually feeds in the last four frames of gameplay,
effectively changing a non-MDP into an MDP. With the last four frames, the agent has
access to the direction and speed of all players.
Below figure gives a light-hearted example of a Markov decision process using all the
concepts we’ve discussed so far. You can see there is a three-element state space S =
{crying baby, sleeping baby, smiling baby}, and a two-element action space A = {feed,
don’t feed}. In addition, we have transition probabilities noted, which are maps from an
action to the probability of an outcome state (we’ll go over this again in the next section).
Of course, in real life, you as the agent have no idea what the transition probabilities are.
If you did, you would have a model of the environment. As you’ll learn later, sometimes
an agent does have access to a model of the environment, and sometimes not. In the cases
where the agent does not have access to the model, we may want our agent to learn a
model of the environment.
Policy functions
In reinforcement learning, a policy function is a function that maps a state to a probability
distribution over the set of possible actions in that state. The policy function determines
the agent's behavior in the environment, by specifying which action to take in each state.
For example, consider the game of Blackjack. A policy function for this game might
specify that the dealer should always hit until they reach a card value of 17 or greater.
This is a simple fixed strategy that the dealer follows in every game.
In the case of the n-armed bandit problem, a policy function might be an epsilon-greedy
strategy, where the agent selects the action with the highest estimated reward with
probability 1-epsilon, and selects a random action with probability epsilon. This policy
function balances exploration (trying out new actions to learn more about their rewards)
with exploitation (selecting the action with the highest estimated reward).
n the context of ad placement optimization, a policy function might specify which ad to
display to a user based on their contextual information (such as the website they are
currently on). The policy function would map the user's context to a probability
distribution over the set of possible ads to display, and the ad with the highest probability
would be selected for display.
Overall, policy functions are a key concept in reinforcement learning, as they determine
the agent's behavior in the environment and ultimately determine the quality of the
learned policy.
Optimal policy
In reinforcement learning, the optimal policy is the strategy that maximizes rewards. It is
the policy that the agent should follow in order to achieve the highest possible reward in
the environment. The optimal policy is typically found by maximizing the expected
reward over all possible policies.
For example, in the game of Blackjack, the optimal policy for the dealer is to always hit
until they reach a card value of 17 or greater. This policy maximizes the dealer's chances
of winning the game.
In the case of the n-armed bandit problem, the optimal policy is to always select the
action with the highest estimated reward. This policy maximizes the expected reward
over all possible actions.
In the context of ad placement optimization, the optimal policy would specify which ad
to display to a user based on their contextual information (such as the website they are
currently on) in order to maximize the probability of ad clicks and ultimately increase
revenue. The optimal policy would be learned by the bandit algorithm over time, as it
observes user interactions and updates its probability distribution over the set of possible
ads to display.
Value functions
In reinforcement learning, value functions are functions that map a state or a state-action
pair to the expected value (the expected reward) of being in some state or taking some
action in some state. The value function is used to evaluate the quality of a policy, by
estimating the expected reward of following that policy.
For example, in the game of Blackjack, a state-value function might estimate the
expected reward of being in a particular state (such as having a hand of 15 and the dealer
showing a 10). This value function would be used to evaluate the quality of different
policies (such as hitting or standing) in that state.
In the case of the n-armed bandit problem, an action-value (Q) function might estimate
the expected reward of taking a particular action in a particular state. This value function
would be used to evaluate the quality of different policies (such as epsilon-greedy or
softmax action selection) in that state.
In the context of ad placement optimization, a value function might estimate the expected
reward of displaying a particular ad to a user based on their contextual information (such
as the website they are currently on). This value function would be used to evaluate the
quality of different policies (such as which ad to display) in that context.
Overall, value functions are a key concept in reinforcement learning, as they are used to
evaluate the quality of different policies and guide the agent's behavior in the
environment.