Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views21 pages

RL Unit1 Notes

The document provides an overview of reinforcement learning (RL), including key concepts such as exploration vs. exploitation, contextual bandits, and the Markov property. It compares dynamic programming and Monte Carlo methods in solving RL problems and outlines the RL framework, emphasizing the role of the agent, environment, and reward signals. Additionally, it discusses practical applications of RL, including game playing, data center optimization, autonomous vehicles, and algorithmic trading.

Uploaded by

tarunreddie717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

RL Unit1 Notes

The document provides an overview of reinforcement learning (RL), including key concepts such as exploration vs. exploitation, contextual bandits, and the Markov property. It compares dynamic programming and Monte Carlo methods in solving RL problems and outlines the RL framework, emphasizing the role of the agent, environment, and reward signals. Additionally, it discusses practical applications of RL, including game playing, data center optimization, autonomous vehicles, and algorithmic trading.

Uploaded by

tarunreddie717
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

REINFORCEMENT LEARNING

IV-I SEM
CSE (AI&ML)

UNIT-I-NOTES

SHORT- ANSWER QUESTIONS

1. Write about Reinforcement Learning?


Answer:
Reinforcement learning is a generic framework for representing and solving control tasks,
but within this framework we are free to choose which algorithms we want to apply to a
particular control task.

2. Why deep Reinforcement Learning?


Answer:
Deep reinforcement learning (DRL) is a subfield of machine learning that utilizes deep
learning models (i.e., neural networks) in reinforcement learning (RL) tasks.

3. Write about String diagrams?


Answer:
Mathematicians are becoming tired of traditional math notation with its huge array of
symbols, and within a particular branch of advanced mathematics called category theory,
mathematicians have developed a purely graphical language called string diagrams.
String diagrams look very similar to flowcharts and circuit diagrams, and they have a
fairly intuitive meaning, but they are just as rigorous and precise as traditional
mathematical notations largely based on Greek and Latin symbols.

4. Define Exploration and exploitation?


Answer:
Exploration and exploitation are two fundamental concepts in the field of reinforcement
learning, which is a type of machine learning concerned with training agents to make
sequential decisions in order to maximize a cumulative reward signal.

Exploration: Exploration refers to the process of trying out different actions or choices
that an agent hasn't yet experienced in order to gather information about their potential
outcomes. It involves taking actions that might not be immediately optimal or known to
the agent, with the aim of discovering new knowledge about the environment and
improving the agent's understanding of the rewards associated with different actions.
Exploration is crucial for learning the underlying dynamics of the environment and
discovering optimal strategies over time.

Exploitation: Exploitation involves making decisions that are currently believed to be the
best based on the agent's current knowledge and experience. It refers to the act of
selecting actions that are expected to yield the highest immediate reward, based on the
agent's existing understanding of the environment. Exploitation is important for
maximizing short-term gains and making optimal decisions based on what is currently
known.

Balancing exploration and exploitation is a key challenge in reinforcement learning. Too


much exploration may lead to slower learning or unnecessary risk-taking, while too much
exploitation may result in suboptimal solutions as the agent fails to discover better
strategies. A successful reinforcement learning agent must strike a balance between
exploring new actions to gather information and exploiting its current knowledge to
maximize cumulative rewards over time. Various algorithms and strategies, such as ε-
greedy, Upper Confidence Bound (UCB), and Thompson Sampling, have been developed
to address this exploration-exploitation trade-off in different contexts.

5. Write about balancing of exploration versus exploitation in reinforcement learning?


Answer:
Balancing exploration versus exploitation is a key concept in reinforcement learning.
Essentially, it refers to the trade-off between trying out new actions (exploration) and
sticking with actions that have worked well in the past (exploitation).

In practice, this might look like a gambler trying out different slot machines to see which
one pays out the most (exploration), while also sticking with a machine that has paid out
well in the past (exploitation).

The proper balance of exploration and exploitation is important to maximizing rewards in


a reinforcement learning problem. There are two specific strategies for balancing these
two approaches: the greedy (or exploitation) method, which simply chooses the best
option based on what is already known, and the epsilon-greedy strategy, which includes
some amount of random exploration to discover new options.

6. Write about Contextual Bandits?


Answer:
Contextual Bandits are a specific class of problems within the broader framework of
reinforcement learning. They involve decision-making scenarios where an agent, known
as the learner, must choose actions from a set of available options (arms) in a sequential
manner to maximize its cumulative reward. However, unlike traditional multi-armed
bandit problems where the rewards for each arm are independent of context, Contextual
Bandits introduce the concept of context or context features.

In a Contextual Bandit setting, each decision point is associated with a context, which
represents relevant information or features that describe the current state or context of the
environment. The goal of the agent is to learn a policy that maps these context features to
the best action to take, considering the expected rewards associated with each action in
the given context. The agent's objective is to strike a balance between exploring different
actions to learn their associated rewards and exploiting its current knowledge to
maximize cumulative rewards.
7. Define Markov Property?
Answer:
Markov property is said to be a Markov decision process (MDP). With an MDP, the current state
alone contains enough information to choose optimal actions to maximize future rewards.
Modeling a control task as an MDP is a key concept in reinforcement learning.

LONG-ANSWER QUESTIONS
1. Write about Dynamic programming versus Monte Carlo method?
Answer:
Dynamic Programming (DP) and Monte Carlo (MC) are two fundamental approaches
used in reinforcement learning to solve Markov decision processes (MDPs) and optimize
decision-making strategies. Both methods aim to find optimal policies that maximize the
cumulative reward over time. However, they have different characteristics and are suited
for different types of problems. Let's compare Dynamic Programming and Monte Carlo
methods in reinforcement learning:

Dynamic Programming (DP):

Principle: DP is a model-based approach that involves solving a problem by breaking it


down into smaller sub-problems and recursively solving these sub-problems to obtain an
optimal solution.
Assumption: DP assumes complete knowledge of the MDP, including the transition
probabilities and reward functions.
Value Iteration and Policy Iteration: DP methods include techniques like Value
Iteration and Policy Iteration. Value Iteration updates the value function of each state
iteratively until convergence, and Policy Iteration alternates between policy evaluation
and policy improvement steps to find the optimal policy.
Convergence: DP methods guarantee convergence to the optimal solution under the
assumption of perfect knowledge of the MDP.
Computational Complexity: DP methods can become computationally expensive for
large state and action spaces, as they require computing and storing the values of all
states.
Bootstrapping: DP methods use bootstrapping, meaning they update the value estimates
of states based on the estimates of other states.

Monte Carlo (MC):

Principle: MC methods are model-free approaches that estimate the value function or the
policy by simulating episodes of interaction with the environment and averaging the
observed returns.
Assumption: MC methods do not require prior knowledge of the transition probabilities
and reward functions. They learn directly from interaction with the environment.
Policy Evaluation: MC methods estimate the value function by averaging the returns
obtained from different trajectories under a given policy.
Exploration: MC methods inherently explore by interacting with the environment and
collecting data.
Variance and Bias: MC methods can have higher variance due to the randomness of the
observed returns, but they are unbiased estimators of the value function.
Sample Efficiency: MC methods may require a larger number of episodes to converge to
the optimal solution compared to DP, especially in complex environments.
Batch and Online Learning: MC methods can be naturally applied in online learning
settings, where data is collected in an ongoing manner.

In summary, Dynamic Programming is more suitable for small and well-defined MDPs
where complete knowledge of the environment is available, while Monte Carlo methods
are effective when the MDP is unknown or too complex to model accurately. MC
methods excel in online learning scenarios where interaction with the environment is
feasible. Both DP and MC have their strengths and weaknesses, and the choice between
them depends on the characteristics of the problem and the available information about
the environment.

2. Explain about Reinforcement Learning Framework?


Answer:

Richard Bellman introduced dynamic programming as a general method of solving


certain kinds of control or decision problems, but it occupies an extreme end of the RL
continuum. Arguably, Bellman’s more important contribution was helping develop the
standard framework for RL problems. The RL framework is essentially the core set of
terms and concepts that every RL problem can be phrased in. This not only provides a
standardized language for communicating with other engineers and researchers, it also
forces us to formulate our problems in a way that is amenable to dynamic programming-
like problem decomposition, such that we can iteratively optimize over local sub-
problems and make progress toward achieving the global high-level objective.
Fortunately, it’s pretty simple too.

The input data is generated by the environment. In general, the environment of a RL (or
control) task is any dynamic process that produces data that is relevant to achieving our
objective. Although we use “environment” as a technical term, it’s not too far abstracted
from its everyday usage. As an instance of a very advanced RL algorithm yourself, you
are always in some environment, and your eyes and ears are constantly consuming
information produced by your environment so you can achieve your daily objectives.
Since the environment is a dynamic process (a function of time), it may be producing a
continuous stream of data of varied size and type. To make things algorithm-friendly, we
need to take this environment data and bundle it into discrete packets that we call the
state (of the environment) and then deliver it to our algorithm at each of its discrete time
steps. The state reflects our knowledge of the environment at some particular time, just as
a digital camera captures a discrete snapshot of a scene at some time (and produces a
consistently formatted image).
To summarize so far, we defined an objective function (minimize costs by optimizing
temperature) that is a function of the state (current costs, current temperature data) of the
environment (the data center and any related processes). The last part of our model is the
RL algorithm itself. This could be any parametric algorithm that can learn from data to
minimize or maximize some objective function by modifying its parameters. It does not
need to be a deep learning algorithm; RL is a field of its own, separate from the concerns
of any particular learning algorithm.

As we noted before, one of the key differences between RL (or control tasks generally)
and ordinary supervised learning is that in a control task the algorithm needs to make
decisions and take actions. These actions will have a causal effect on what happens in the
future. Taking an action is a keyword in the framework, and it means more or less what
you’d expect it to mean. However, every action taken is the result of analyzing the
current state of the environment and attempting to make the best decision based on that
information.

The last concept in the RL framework is that after each action is taken, the algorithm is
given a reward. The reward is a (local) signal of how well the learning algorithm is
performing at achieving the global objective. The reward can be a positive signal (i.e.,
doing well, keep it up) or a negative signal (i.e., don’t do that) even though we call both
situations a “reward.”

The reward signal is the only cue the learning algorithm has to go by as it updates itself in
hopes of performing better in the next state of the environment. In our data center
example, we might grant the algorithm a reward of +10 (an arbitrary value) whenever its
action reduces the error value. Or more reasonably, we might grant a reward proportional
to how much it decreases the error. If it increases the error, we would give it a negative
reward.

Lastly, let’s give our learning algorithm a fancier name, calling it the agent. The agent is
the action-taking or decision-making learning algorithm in any RL problem. We can put
this all together as shown in figure.
Figure: The standard framework for RL algorithms. The agent takes an action in the
environment, such as moving a chess piece, which then updates the state of the
environment. For every action it takes, it receives a reward (e.g., +1 for winning the
game, –1 for losing the game, 0 otherwise). The RL algorithm repeats this process with
the objective of maximizing rewards in the long term, and it eventually learns how the
environment works.

3. Write about what can I do with Reinforcement Learning?


Answer:
We began this by reviewing the basics of ordinary supervised machine learning
algorithms, such as image classifiers, and although recent successes in supervised
learning are important and useful, supervised learning is not going to get us to artificial
general intelligence (AGI). We ultimately seek general-purpose learning machines that
can be applied to multiple problems with minimal to no supervision and whose repertoire
of skills can be transferred across domains. Large data-rich companies can gainfully
benefit from supervised approaches, but smaller companies and organizations may not
have the resources to exploit the power of machine learning. General-purpose learning
algorithms would level the playing field for everyone, and reinforcement learning is
currently the most promising approach toward such algorithms.

RL research and applications are still maturing, but there have been many exciting
developments in recent years. Google’s DeepMind research group has showcased some
impressive results and garnered international attention. The first was in 2013 with an
algorithm that could play a spectrum of Atari games at superhuman levels. Previous
attempts at creating agents to solve these games involved fine-tuning the underlying
algorithms to understand the specific rules of the game, often called feature engineering.
These feature engineering approaches can work well for a particular game, but they are
unable to transfer any knowledge or skills to a new game or domain. DeepMind’s deep
Q-network (DQN) algorithm was robust enough to work on seven games without any
game-specific tweaks. It had nothing more than the raw pixels from the screen as input
and was merely told to maximize the score, yet the algorithm learned how to play beyond
an expert human level.

More recently, DeepMind’s AlphaGo and AlphaZero algorithms beat the world’s best
players at the ancient Chinese game Go. Experts believed artificial intelligence would not
be able to play Go competitively for at least another decade because the game has
characteristics that algorithms typically don’t handle well. Players do not know the best
move to make at any given turn and only receive feedback for their actions at the end of
the game. Many high-level players saw themselves as artists rather than calculating
strategists and described winning moves as being beautiful or elegant. With over 10170
legal board positions, brute force algorithms (which IBM’s Deep Blue used to win at
chess) were not feasible. AlphaGo managed this feat largely by playing simulated games
of Go millions of times and learning which actions maximized the rewards of playing the
game well. Similar to the Atari case, AlphaGo only had access to the same information a
human player would: where the pieces were on the board.
While algorithms that can play games better than humans are remarkable, the promise
and potential of RL goes far beyond making better game bots. DeepMind was able to
create a model to decrease Google’s data center cooling costs by 40%, something we
explored earlier in this chapter as an example. Autonomous vehicles use RL to learn
which series of actions (accelerating, turning, breaking, signaling) leads to passengers
reaching their destinations on time and to learn how to avoid accidents. And researchers
are training robots to complete tasks, such as learning to run, without explicitly
programming complex motor skills.

Many of these examples are high stakes, like driving a car. You cannot just let a learning
machine learn how to drive a car by trial and error. Fortunately, there are an increasing
number of successful examples of letting learning machines loose in harmless simulators,
and once they have mastered the simulator, letting them try real hardware in the real
world. One instance that we will explore in this book is algorithmic trading. A substantial
fraction of all stock trading is executed by computers with little to no input from human
operators. Most of these algorithmic traders are wielded by huge hedge funds managing
billions of dollars. In the last few years, however, we’ve seen more and more interest by
individual traders in building trading algorithms. Indeed, Quantopian provides a platform
where individual users can write trading algorithms in Python and test them in a safe,
simulated environment. If the algorithms perform well, they can be used to trade real
money. Many traders have achieved relative success with simple heuristics and rule-
based algorithms. However, equity markets are dynamic and unpredictable, so a
continuously learning RL algorithm has the advantage of being able to adapt to changing
market conditions in real time.

One practical problem we’ll tackle early in this book is advertisement placement. Many
web businesses derive significant revenue from advertisements, and the revenue from ads
is often tied to the number of clicks those ads can garner. There is a big incentive to place
advertisements where they can maximize clicks. The only way to do this, however, is to
use knowledge about the users to display the most appropriate ads. We generally don’t
know what characteristics of the user are related to the right ad choices, but we can
employ RL techniques to make some headway. If we give an RL algorithm some
potentially useful information about the user (what we would call the environment, or
state of the environment) and tell it to maximize ad clicks, it will learn how to associate
its input data to its objective, and it will eventually learn which ads will produce the most
clicks from a particular user.

4. Explain about solving Multi-Arm Bandit problem?


Answer:
In reinforcement learning, the multi-arm bandit problem refers to a simplified scenario
where an agent must decide which action (or "arm") to choose in order to maximize its
cumulative reward over time. This problem serves as a fundamental building block in
reinforcement learning and decision-making, providing insights into the exploration-
exploitation trade-off.
In the context of reinforcement learning, the multi-arm bandit problem is often used as a
basic model to study how an agent should allocate its actions (or "pulls") among a set of
possible choices. The agent doesn't have complete knowledge about the rewards
associated with each action; instead, it needs to learn the rewards through interactions
over time. The agent's goal is to learn a policy that guides its decisions to achieve the
highest expected cumulative reward.

The multi-arm bandit problem in reinforcement learning is characterized by the following


components:

1. Actions (Arms): The set of choices or actions available to the agent. Each action
corresponds to an "arm" of the bandit.

2. Reward Distributions: Each action has an associated reward distribution (usually


assumed to be stationary over time). The agent does not know the true reward
distribution and needs to learn it from experience.

3. Exploration: The agent needs to balance exploration (trying out different actions to
gather information) and exploitation (choosing actions with the highest estimated
rewards based on available information).

4. Cumulative Reward: The agent aims to maximize its cumulative reward over a series
of interactions by making informed decisions about which action to select at each
step.

There are several strategies and algorithms used to address the multi-arm bandit problem,
each with its own approach to balancing exploration and exploitation. Here are some
common solutions for the multi-arm bandit problem:

Epsilon-Greedy Algorithm:
This is a simple and widely used approach. The agent selects the arm with the highest
estimated reward with probability (1-ε), and explores a random arm with probability ε.
Over time, the algorithm tends to converge to the optimal arm while still exploring other
arms.

EXP3 Algorithm:
This algorithm combines the Explore-then-Commit strategy with the UCB algorithm. It
explores a subset of arms initially and then exploits the best arm based on the exploration
results.

Softmax solution:
The softmax solution is a popular approach to solving the multi-arm bandit problem that
involves using the softmax function to model the agent's exploration and exploitation
behavior. This approach is also known as the "Boltzmann exploration" strategy or the
"soft-max" algorithm. It provides a probabilistic way to select arms based on their
estimated rewards.
Contextual Bandits:
In this extension of the problem, additional contextual information about the arms is
provided. The agent learns a mapping from context to actions, allowing for more
personalized and efficient decision-making.

Deep Reinforcement Learning:


When dealing with a large number of arms or complex reward structures, deep
reinforcement learning techniques can be used. Deep neural networks can approximate
the reward distribution, and techniques like Q-learning or policy gradients can be
employed.

5. Explain about epsilon-Greedy strategy in Reinforcement learning, and give an


example of how it may be applied in practice?
Answer:
Epsilon-greedy strategy
The epsilon-greedy strategy is a common approach to balancing exploration and
exploitation in reinforcement learning. Essentially, it involves choosing the best known
option most of the time (the "greedy" part), but also occasionally trying out a random
option (the "epsilon" part) to see if there might be a better choice that hasn't been
discovered yet.

In practice, this might look like a gambler who usually plays the slot machine that has
paid out the most in the past (the "greedy" part), but occasionally tries out a different
machine just to see if it might pay out even more (the "epsilon" part).

The epsilon-greedy strategy is often used in reinforcement learning because it strikes a


good balance between exploiting what is already known and exploring new options. By
occasionally trying out new actions, the agent can discover better options that might not
have been found otherwise. At the same time, by mostly sticking with what is already
known to work, the agent can avoid wasting time on actions that are unlikely to pay off.

Implement the epsilon-greedy strategy in practice:


Implementing the epsilon-greedy strategy in practice is relatively straightforward. Here
are the basic steps:

1. Choose a value for epsilon (the "exploration rate"). This value determines how often
the agent will choose a random action instead of the best known action. A common value
for epsilon is 0.1, which means that the agent will choose a random action 10% of the
time.

2. At each time step, generate a random number between 0 and 1. If the random number
is less than epsilon, choose a random action. Otherwise, choose the action with the
highest estimated value (the "greedy" part).

3. Update the estimated value of the chosen action based on the reward received.
4. Repeat steps 2-3 for a fixed number of time steps or until convergence.

In practice, the epsilon-greedy strategy can be used to solve a variety of problems,


including the classic "multi-armed bandit" problem. In this problem, the agent must
choose between several slot machines (or "arms") with unknown payout probabilities.
The goal is to maximize the total payout over a fixed number of time steps.

To solve this problem using the epsilon-greedy strategy, the agent would start by
randomly choosing one of the slot machines. After each play, the agent would update the
estimated value of the chosen machine based on the reward received. Then, at each
subsequent play, the agent would choose the machine with the highest estimated value
with probability 1-epsilon, and a random machine with probability epsilon. Over time,
the agent would converge on the machine with the highest payout probability, while still
occasionally trying out other machines to see if they might be better.

The epsilon-greedy strategy can be used to solve a multi-armed bandit problem:

Here is a more detailed example of how the epsilon-greedy strategy can be used to solve
a multi-armed bandit problem:

Suppose we have a gambler who is trying to maximize their winnings by playing one of
several slot machines. Each machine has an unknown payout probability, and the gambler
must choose which machine to play at each time step. The goal is to maximize the total
payout over a fixed number of time steps.

To solve this problem using the epsilon-greedy strategy, the gambler would start by
randomly choosing one of the slot machines. After each play, the gambler would
update the estimated value of the chosen machine based on the reward received. For
example, if the gambler played machine A and won 10 coins, they might update the
estimated value of machine A to be 10.

Then, at each subsequent play, the gambler would choose the machine with the highest
estimated value with probability 1-epsilon, and a random machine with probability
epsilon. For example, if the gambler had estimated values of 10 for machine A, 5 for
machine B, and 3 for machine C, and epsilon was set to 0.1, the gambler would choose
machine A with probability 0.9, machine B with probability 0.05, and machine C with
probability 0.05.

Over time, the gambler would converge on the machine with the highest payout
probability, while still occasionally trying out other machines to see if they might be
better. For example, if machine A had a higher payout probability than the other
machines, the gambler would eventually start playing machine A more often than the
other machines, but would still occasionally try out the other machines to see if their
estimated values had changed.
The epsilon-greedy strategy is a simple but effective way to balance exploration and
exploitation in the multi-armed bandit problem. By occasionally trying out new machines,
the gambler can discover better options that might not have been found otherwise. At the
same time, by mostly sticking with the machines that have already paid out well, the gambler
can avoid wasting time on machines that are unlikely to pay off.

6. Explain about soft-max selection strategy in Reinforcement learning, and give an


example of how it may be applied in practice?
Answer:
Soft-max selection policy:
The soft-max selection policy is another approach to balancing exploration and
exploitation in reinforcement learning. Instead of choosing actions randomly (as in the
epsilon-greedy strategy), the soft-max selection policy assigns probabilities to each
possible action based on its estimated value. Actions with higher estimated values are
assigned higher probabilities, while actions with lower estimated values are assigned
lower probabilities.

In practice, this might look like a gambler who assigns probabilities to each slot machine
based on how much they have paid out in the past. The machine that has paid out the
most might be assigned a probability of 0.8, while the machine that has paid out the least
might be assigned a probability of 0.1. The gambler would then choose a machine at
random based on these probabilities.

The soft-max selection policy is often used in reinforcement learning because it provides
a more nuanced approach to balancing exploration and exploitation. By assigning
probabilities to each action, the agent can explore new options while still favoring actions
that are known to be good. Additionally, the soft-max selection policy can provide
information about the relative values of different actions, which can be useful in
situations where the agent needs to make more complex decisions.

Implementation of the soft-max selection policy in practice:

Here is an explanation of how to implement the soft-max selection policy in practice,


along with an example of how it might be used to solve a medical treatment problem:

To implement the soft-max selection policy, we first need to estimate the value of each
possible action. This can be done using a variety of methods, such as Monte Carlo
simulation or temporal difference learning. Once we have estimated the value of each
action, we can use the soft-max function to assign probabilities to each action. The soft-
max function is defined as:

Pr(A) = exp(Q(A) / τ) / Σ[exp(Q(a) / τ)]


where Q(A) is the estimated value of action A, τ is a "temperature" parameter that
controls the degree of exploration, and Σ[exp(Q(a) / τ)] is the sum of the exponential
values of all possible actions.

Soft-max selection policy to solve a medical treatment problem:


To use the soft-max selection policy to solve a medical treatment problem, we might start
by considering a doctor who is trying to choose the best treatment for a patient with a
heart attack. The doctor has 10 possible treatments to choose from, each with different
efficacies and risk profiles. The goal is to choose the treatment that is most likely to save
the patient's life.

To solve this problem using the soft-max selection policy, the doctor would start by
estimating the value of each treatment based on previous outcomes. For example, if
treatment A had saved 8 out of 10 patients, while treatment B had saved 6 out of 10
patients, the doctor might estimate the value of treatment A to be higher than the value of
treatment B.

Then, at each subsequent patient encounter, the doctor would use the soft-max function to
assign probabilities to each treatment based on their estimated values. For example, if
treatment A had an estimated value of 0.8 and treatment B had an estimated value of 0.6,
the doctor might assign a probability of 0.73 to treatment A and a probability of 0.27 to
treatment B (assuming a temperature parameter of 1).

Over time, the doctor would converge on the treatment with the highest estimated value,
while still occasionally trying out other treatments to see if they might be better. By using
the soft-max selection policy, the doctor can explore different treatment options while
still favoring treatments that are known to be effective. Additionally, the soft-max
function provides information about the relative values of different treatments, which can
be useful in situations where the doctor needs to make more complex decisions.

Soft-max selection policy to solve a multi-armed bandit problem:


To use the soft-max selection policy to solve a multi-armed bandit problem, we might
start by considering a gambler who is trying to maximize their winnings by playing one
of several slot machines. Each machine has an unknown payout probability, and the
gambler must choose which machine to play at each time step. The goal is to maximize
the total payout over a fixed number of time steps.

To solve this problem using the soft=max selection policy, the gambler would start by
estimating the value of each machine based on previous outcomes. For example, if
machine A had paid out 8 coins on average, while machine B had paid out 6 coins on
average, the gambler might estimate the value of machine A to be higher than the value
of machine B.

Then, at each subsequent play, the gambler would use the soft-max function to assign
probabilities to each machine based on their estimated values. For example, if machine A
had an estimated value of 0.8 and machine B had an estimated value of 0.6, the gambler
might assign a probability of 0.73 to machine A and a probability of 0.27 to machine B
(assuming a temperature parameter of 1).

Over time, the gambler would converge on the machine with the highest payout probability,
while still occasionally trying out other machines to see if they might be better. By using the
softmax selection policy, the gambler can explore different machines while still favoring
machines that are known to pay out well. Additionally, the soft-max function provides
information about the relative values of different machines, which can be useful in situations
where the gambler needs to make more complex decisions.

7. Write about applying bandits to optimize ad placements?


Answer:
In the context of ad placement optimization, the problem is to maximize the probability
that a user will click on an ad. This is where bandit algorithms come in.

The basic idea is to treat the ad placement problem as a multi-armed bandit problem,
where each "arm" corresponds to a different ad placement strategy. The algorithm then
tries different strategies and learns which ones are most effective over time.

One popular approach is to use softmax action selection, which computes probabilities
for each arm based on their current action values. The algorithm then chooses an arm
randomly but weighted by these probabilities. This approach tends to converge faster on
the maximal mean reward compared to the epsilon-greedy method.

Of course, in practice, there are many additional complexities to consider, such as the fact
that the best ad placement may depend on the context (e.g. which website the user is on).
This is where contextual bandits come in, which add a new layer of complexity to the
problem.

There are many companies that use bandit algorithms for ad placement optimization. One
example is Google, which uses a bandit algorithm called "AdWords" to optimize ad
placement on its search engine results pages.

Another example is Microsoft, which uses a bandit algorithm called "LinUCB" to


optimize ad placement on its Bing search engine.

In addition, many other companies in the advertising industry use bandit algorithms to
optimize ad placement, including Facebook, Amazon, and many others.

Overall, bandit algorithms have become an important tool for companies looking to
maximize the effectiveness of their ad placements and improve their return on
investment.

8. Write briefly about the following


a. Compare PyTorch with Numpy
b. Building models with PyTorch
Answer:

Compare PyTorch with Numpy:

PyTorch and NumPy are both popular libraries for numerical computing in Python, but
they have some key differences. Here are a few:

1. Tensors vs arrays: In PyTorch, the basic data structure is the tensor, which is similar to
a NumPy array but with some additional features. Tensors can be used to represent multi-
dimensional arrays of numerical data, and they can be easily moved between CPU and
GPU memory for efficient computation.

2. Automatic differentiation: PyTorch includes a powerful automatic differentiation


engine, which allows you to easily compute gradients of functions with respect to their
inputs. This is useful for many machine learning tasks, such as training neural networks.
NumPy does not include automatic differentiation, although there are other libraries (such
as Autograd) that can be used for this purpose.

3. GPU acceleration: PyTorch includes built-in support for GPU acceleration, which can
greatly speed up numerical computations. NumPy can also be used with GPUs, but it
requires additional libraries (such as CuPy) to do so.

4. Dynamic computation graphs: PyTorch uses dynamic computation graphs, which


means that the graph is constructed on-the-fly as the program runs. This allows for more
flexibility in building complex models, but it can also make debugging more difficult.
NumPy uses static computation graphs, which are defined ahead of time and cannot be
changed during runtime.

Overall, PyTorch and NumPy are both powerful libraries for numerical computing, but
they have different strengths and weaknesses depending on the task at hand. PyTorch is
particularly well-suited for machine learning tasks that require automatic differentiation
and GPU acceleration, while NumPy is more general-purpose and can be used for a wide
range of numerical computations.

Building models with PyTorch

PyTorch is a popular library for building deep learning models in Python. Here are some
of the key steps involved in building models with PyTorch:

1. Define the model architecture: The first step in building a PyTorch model is to define
its architecture. This involves specifying the number and type of layers in the model, as
well as any activation functions or other operations that should be applied to the data.
2. Define the loss function: The next step is to define the loss function, which is used to
measure how well the model is performing. This will depend on the specific task you are
trying to solve (e.g. classification, regression, etc.).

3. Define the optimizer: The optimizer is used to update the model parameters during
training in order to minimize the loss function. PyTorch includes a variety of built-in
optimizers, such as stochastic gradient descent (SGD), Adam, and RMSprop.

4. Train the model: Once the model architecture, loss function, and optimizer have been
defined, you can begin training the model. This involves feeding training data into the
model, computing the loss, and using the optimizer to update the model parameters.

5. Evaluate the model: After the model has been trained, you can evaluate its
performance on a separate validation or test dataset. This will give you an idea of how
well the model is likely to perform on new, unseen data.

6. Fine-tune the model: Depending on the results of the evaluation, you may need to fine-
tune the model by adjusting its architecture, hyperparameters, or other settings. This
process may involve repeating steps 1-5 multiple times until you achieve satisfactory
results.

Overall, building models with PyTorch involves a combination of defining the model
architecture, loss function, and optimizer, as well as training and evaluating the model to
achieve the best possible performance on your specific task.

9. Explain about solving contextual bandits?


Answer:
Contextual bandits are a type of reinforcement learning problem where an agent must
select actions based on contextual information in order to maximize some reward signal.
In the case of ad placement optimization, the contextual information might include the
website the user is currently on, their search history, or other demographic information.
The goal is to select the best ad to display to the user based on this contextual
information, in order to maximize the probability of ad clicks and ultimately increase
revenue. Contextual bandits are a more complex version of the classic multi-armed bandit
problem, where the rewards for each action are fixed and do not depend on any context.

Solving contextual bandits involves finding the optimal policy for selecting actions in a
multi-armed bandit problem where the rewards for each action may depend on some
context or state information.

In the case of contextual bandits for advertisement placement, the goal is to find the best
ad to place for a given user context (such as the website they are currently on). This
involves selecting an action (i.e. an ad) that maximizes the expected reward, taking into
account the context information.
One approach to solving contextual bandits is to use a model-based approach, where the
agent learns a model of the reward function based on the context information and uses
this model to select actions. Another approach is to use a model-free approach, where the
agent learns the optimal policy directly from experience, without explicitly modeling the
reward function.

In the case of model-free approaches, one popular algorithm for solving contextual
bandits is the contextual bandit algorithm, which uses a neural network to estimate the
expected reward for each action given the context information. The network is trained
using a variant of stochastic gradient descent, where the loss function is the negative log-
likelihood of the chosen action given the context.

Overall, solving contextual bandits involves finding the best strategy for selecting actions
in a multi-armed bandit problem where the rewards depend on some context or state
information. This can be done using a variety of model-based or model-free approaches,
depending on the specific problem and available data.

The downside to using softmax action selection for ad placement optimization is that it
can be overly confident in its predictions, leading to suboptimal results. This is because
softmax action selection assigns probabilities to each action based on their expected
rewards, and the action with the highest probability is selected. However, if the expected
rewards are uncertain or noisy, this can lead to the algorithm being overly confident in its
predictions and selecting suboptimal actions.

The bandit algorithm works in the context of ad placement optimization by selecting the
best ad to display to a user based on their context (such as the website they are currently
on). The algorithm learns from past user interactions to improve its predictions over time.
Specifically, the algorithm maintains a probability distribution over the possible ads to
display, and selects an ad to display based on this distribution. After the ad is displayed,
the algorithm updates its probability distribution based on the observed reward (i.e.
whether the user clicked on the ad or not).

There are many real-world examples of companies using bandit algorithms for ad
placement optimization. For example, Google uses a bandit algorithm to optimize ad
placements on its search results pages. The algorithm selects the best ad to display based
on the user's search query and other contextual information. Similarly, Facebook uses a
bandit algorithm to optimize ad placements in its news feed. The algorithm selects the
best ad to display based on the user's interests and other contextual information. Other
companies that use bandit algorithms for ad placement optimization include Amazon,
Netflix, and Microsoft.

10. Write about the Markov property with examples?


Answer:
The Markov property is a fundamental concept in reinforcement learning (RL) that
describes the property of a state in a Markov decision process (MDP). An MDP is a
mathematical framework used to model sequential decision-making problems, such as
those encountered in RL. The Markov property states that the future state of the
environment (or system) depends only on the current state and the action taken, and not
on the sequence of states and actions that led to the current state. A game (or any other
control task) that exhibits the Markov property is said to be a Markov decision process
(MDP). With an MDP, the current state alone contains enough information to choose
optimal actions to maximize future rewards. Modelling a control task as an MDP is a key
concept in reinforcement learning.

The MDP model simplifies an RL problem dramatically, as we do not need to take into
account all previous states or actions—we don’t need to have memory, we just need to
analyse the present situation. Hence, we always attempt to model a problem as (at least
approximately) a Markov decision processes. The card game Blackjack (Also known as
21) is an MDP because we can play the game successfully just by knowing our current
state (what cards we have, and the dealer’s one face-up card).

The Markov Property state that: “Future is Independent of the past given the present”
Mathematically we can express this statement as:

S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this
equation means is that the transition from state S[t] to S[t+1] is entirely independent of
the past. So, the RHS of the Equation means the same as LHS if the system has a Markov
Property. Intuitively meaning that our current state already captures the information of
the past states.

To test your understanding of the Markov property, consider each control problem or
decision task in the following list and see if it has the Markov property or not:
 Driving a car
 Deciding whether to invest in a stock or not
 Choosing a medical treatment for a patient
 Diagnosing a patient’s illness
 Predicting which team will win in a football game
 Choosing the shortest route (by distance) to some destination
 Aiming a gun to shoot a distant target

Here are our answers and brief explanations:

 Driving a car can generally be considered to have the Markov property because you
don’t need to know what happened 10 minutes ago to be able to optimally drive your
car. You just need to know where everything is right now and where you want to go.
 Deciding whether to invest in a stock or not does not meet the criteria of the Markov
property since you would want to know the past performance of the stock in order to
make a decision.

 Choosing a medical treatment seems to have the Markov property because you don’t
need to know the biography of a person to choose a good treatment for what ails them
right now.

 In contrast, diagnosing (rather than treating) would definitely require knowledge of


pass states. It is often very important to know the historical course of a patient’s
symptoms in order to make a diagnosis.

 Predicting which football team will win does not have the Markov property, since,
like the stock example, you need to know the past performance of the football teams
to make a good prediction.

 Choosing the shortest route to a destination has the Markov property because you just
need to know the distance to the destination for various routes, which doesn’t depend
on what happened yesterday.

 Aiming a gun to shoot a distant target also has the Markov property since all you need
to know is where the target is, and perhaps current conditions like wind velocity and
the particulars of your gun. You don’t need to know the wind velocity of yesterday.

DeepMind’s deep Q-learning (or deep Q-network) algorithm learned to play Atari games
from just raw pixel data and the current score. Do Atari games have the Markov
property? Not exactly. In the game Pacman, if our state is the raw pixel data from our
current frame, we have no idea if the enemy a few tiles away is approaching us or moving
away from us, and that would strongly influence our choice of actions to take. This is
why DeepMind’s implementation actually feeds in the last four frames of gameplay,
effectively changing a non-MDP into an MDP. With the last four frames, the agent has
access to the direction and speed of all players.

Below figure gives a light-hearted example of a Markov decision process using all the
concepts we’ve discussed so far. You can see there is a three-element state space S =
{crying baby, sleeping baby, smiling baby}, and a two-element action space A = {feed,
don’t feed}. In addition, we have transition probabilities noted, which are maps from an
action to the probability of an outcome state (we’ll go over this again in the next section).
Of course, in real life, you as the agent have no idea what the transition probabilities are.
If you did, you would have a model of the environment. As you’ll learn later, sometimes
an agent does have access to a model of the environment, and sometimes not. In the cases
where the agent does not have access to the model, we may want our agent to learn a
model of the environment.

11. Write briefly about the following


a. Policy functions
b. Optimal policy
c. Value functions
Answer:

Policy functions
In reinforcement learning, a policy function is a function that maps a state to a probability
distribution over the set of possible actions in that state. The policy function determines
the agent's behavior in the environment, by specifying which action to take in each state.

For example, consider the game of Blackjack. A policy function for this game might
specify that the dealer should always hit until they reach a card value of 17 or greater.
This is a simple fixed strategy that the dealer follows in every game.

In the case of the n-armed bandit problem, a policy function might be an epsilon-greedy
strategy, where the agent selects the action with the highest estimated reward with
probability 1-epsilon, and selects a random action with probability epsilon. This policy
function balances exploration (trying out new actions to learn more about their rewards)
with exploitation (selecting the action with the highest estimated reward).
n the context of ad placement optimization, a policy function might specify which ad to
display to a user based on their contextual information (such as the website they are
currently on). The policy function would map the user's context to a probability
distribution over the set of possible ads to display, and the ad with the highest probability
would be selected for display.

Overall, policy functions are a key concept in reinforcement learning, as they determine
the agent's behavior in the environment and ultimately determine the quality of the
learned policy.

Optimal policy
In reinforcement learning, the optimal policy is the strategy that maximizes rewards. It is
the policy that the agent should follow in order to achieve the highest possible reward in
the environment. The optimal policy is typically found by maximizing the expected
reward over all possible policies.

For example, in the game of Blackjack, the optimal policy for the dealer is to always hit
until they reach a card value of 17 or greater. This policy maximizes the dealer's chances
of winning the game.

In the case of the n-armed bandit problem, the optimal policy is to always select the
action with the highest estimated reward. This policy maximizes the expected reward
over all possible actions.

In the context of ad placement optimization, the optimal policy would specify which ad
to display to a user based on their contextual information (such as the website they are
currently on) in order to maximize the probability of ad clicks and ultimately increase
revenue. The optimal policy would be learned by the bandit algorithm over time, as it
observes user interactions and updates its probability distribution over the set of possible
ads to display.

Overall, the optimal policy is a key concept in reinforcement learning, as it determines


the best possible behavior for the agent in the environment.

Value functions

In reinforcement learning, value functions are functions that map a state or a state-action
pair to the expected value (the expected reward) of being in some state or taking some
action in some state. The value function is used to evaluate the quality of a policy, by
estimating the expected reward of following that policy.

For example, in the game of Blackjack, a state-value function might estimate the
expected reward of being in a particular state (such as having a hand of 15 and the dealer
showing a 10). This value function would be used to evaluate the quality of different
policies (such as hitting or standing) in that state.
In the case of the n-armed bandit problem, an action-value (Q) function might estimate
the expected reward of taking a particular action in a particular state. This value function
would be used to evaluate the quality of different policies (such as epsilon-greedy or
softmax action selection) in that state.

In the context of ad placement optimization, a value function might estimate the expected
reward of displaying a particular ad to a user based on their contextual information (such
as the website they are currently on). This value function would be used to evaluate the
quality of different policies (such as which ad to display) in that context.

Overall, value functions are a key concept in reinforcement learning, as they are used to
evaluate the quality of different policies and guide the agent's behavior in the
environment.

You might also like