Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views47 pages

ML Mod 5

The document discusses clustering algorithms as a key unsupervised learning technique for partitioning unlabelled data into meaningful groups. It covers various clustering methods, including hierarchical, partitional (like k-means), and density-based approaches (like DBSCAN), along with their advantages and challenges. Additionally, it introduces reinforcement learning, explaining its principles, applications, and the importance of reward design in training agents through trial and error.

Uploaded by

4MU22CS052
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views47 pages

ML Mod 5

The document discusses clustering algorithms as a key unsupervised learning technique for partitioning unlabelled data into meaningful groups. It covers various clustering methods, including hierarchical, partitional (like k-means), and density-based approaches (like DBSCAN), along with their advantages and challenges. Additionally, it introduces reinforcement learning, explaining its principles, applications, and the importance of reward design in training agents through trial and error.

Uploaded by

4MU22CS052
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

BCS602 | MACHINE LEARNING| VTU Belagavi.

Module-5

Chapter – 01 - Clustering Algorithms

Introduction to Clustering Approaches

 Cluster analysis is the fundamental task of unsupervised learning. Unsupervised learning

involves exploring the given dataset.

 Cluster analysis is a technique of partitioning a collection of unlabelled objects that have

many attributes into meaningful disjoint groups or clusters.

 This is done using a trial and error approach as there are no supervisors available as in

classification.

 The characteristic of clustering is that the objects in the clusters or groups are similar to

each other within the clusters while differ from the objects in other clusters

significantly.

 The input for cluster analysis is examples or samples. These are known as objects, data

points or data instances.

 All these terms are same and used interchangeably in this chapter. All the samples or objects

with no labels associated with them are called unlabelled.

 The output is the set of clusters (or groups) of similar data if it exists in the input.

 For example, the following Figure 13.1(a) shows data points or samples with two features

shown in different shaded samples and Figure 13.1(b) shows the manually drawn ellipse to

indicate the clusters formed.

Machine Learning. Page 1


BCS602 | MACHINE LEARNING| VTU Belagavi.

Machine Learning. Page 2


BCS602 | MACHINE LEARNING| VTU Belagavi.

Visual identification of clusters in this case is easy as the examples have only two features.

But, when examples have more features, say 100, then clustering cannot be done manually and
automatic clustering algorithms are required.

Also, automating the clustering process is desirable as these tasks are considered difficult by
humans and almost impossible. All clusters are repre- sented by centroids.

Example: For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then the centroid
is given as.

The clusters should not overlap and every cluster should represent only one class. Therefore,
clustering algorithms use trial and error method to form clusters that can be converted to labels.

Difference between Clustering & Classification

Machine Learning. Page 3


BCS602 | MACHINE LEARNING| VTU Belagavi.

Applications of Clustering

Challenges of Clustering Algorithms

High-Dimensional Data

o As the number of features increases, clustering becomes difficult.

Scalability Issue

o Some algorithms perform well for small datasets but fail for large-scale data.

Unit Inconsistency

o Different measurement units (e.g., kg vs. pounds) can create problems.

Proximity Measure Design

o Choosing an appropriate distance metric is crucial for accurate clustering.

Machine Learning. Page 4


BCS602 | MACHINE LEARNING| VTU Belagavi.

Advantages and Disadvantages of Clustering Algorithms

Proximity Measures

Proximity measures determine similarity or dissimilarity among objects. Distance

measures (dissimilarity) indicate how different objects are.

Similarity measures indicate how alike objects are. property:

More distance → Less similarity, and vice versa. Properties of

Distance Measures (Metric Conditions)

Machine Learning. Page 5


BCS602 | MACHINE LEARNING| VTU Belagavi.

Types of Distance Measures Based on Data Types Quantitative

Variables

Machine Learning. Page 6


BCS602 | MACHINE LEARNING| VTU Belagavi.

Binary Attributes

Categorical Variables

Distance is 1 if different, 0 if same.

Example: Gender (Male, Female) → Distance = 1

Machine Learning. Page 7


BCS602 | MACHINE LEARNING| VTU Belagavi.

Ordinal Variables

Vector-Based Distance Measures (For Text & Documents)

Cosine Similarity

o Measures angle between two vectors.


o Formula:

Machine Learning. Page 8


BCS602 | MACHINE LEARNING| VTU Belagavi.

Distance Measures

Hierarchical Clustering Algorithms

Overview

 Produces a nested partition of objects with hierarchical relationships.


 Represented using a dendrogram.
 Two main categories: Agglomerative and Divisive methods.

Types of Hierarchical Clustering

1. Agglomerative Methods (Bottom-Up)


o Each sample starts as an individual cluster.
o Clusters are merged iteratively until one cluster remains.
o Once a cluster is formed, it cannot be undone (irreversible).
2. Divisive Methods (Top-Down)
o Starts with a single cluster containing all data points.
o Splits iteratively into smaller clusters.
o Continues until each sample becomes its own cluster.

Machine Learning. Page 9


BCS602 | MACHINE LEARNING| VTU Belagavi.

Agglomerative Clustering Techniques

Single Linkage (MIN Algorithm)

o Merges clusters based on the smallest distance between two points from different clusters.
o Related to the Minimum Spanning Tree (MST).

Complete Linkage (MAX or Clique Algorithm)

Machine Learning. Page 10


BCS602 | MACHINE LEARNING| VTU Belagavi.

Average Linkage Algorithm

Mean-Shift Clustering Algorithm

 Non-parametric and hierarchical clustering technique.


 Also known as mode-seeking or sliding window algorithm.
 No prior knowledge of cluster count or shape required.
 Moves towards high-density regions in data using a kernel function (e.g., Gaussian
window).

Machine Learning. Page 11


BCS602 | MACHINE LEARNING| VTU Belagavi.

Advantages of Mean-Shift Clustering

No model assumptions

Suitable for all non-convex shapes

Only one parameter of the window, that is, bandwidth is required Robust to noise No

issues of local minima or premature termination

Disadvantages of Mean-Shift Clustering

Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. If
it is small, then many points are missed and convergence occurs as the problem.

The number of clusters cannot be specified and user has no control over this parameter.

Partitional Clustering Algorithm

 k-means is a widely used partitional clustering algorithm.


 The user specifies k, the number of clusters.
 Assumes non-overlapping clusters.
 Works well for circular or spherical clusters.

Process of k-means Algorithm

1. Initialization
o Select k initial cluster centers (randomly or using prior knowledge).
o Normalize data for better performance.
2. Assignment of Data Points
o Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids

Machine Learning. Page 12


BCS602 | MACHINE LEARNING| VTU Belagavi.

o Compute the mean vector of assigned points to update cluster centroids.


o Repeat this process until no changes occur in cluster assignments.
4. Termination
o The process stops when cluster assignments remain unchanged.

Mathematical Optimization

Advantages

1. Simple and easy to implement.


2. Efficient for small to medium datasets.

Machine Learning. Page 13


BCS602 | MACHINE LEARNING| VTU Belagavi.

Disadvantages

1. Sensitive to initialization – different initial points may lead to different results.


2. Time-consuming for large datasets – requires multiple iterations.

Choosing the Value of k

 No fixed rule for selecting k.


 Use Elbow Method:
o Run k-means with different values of k.
o Plot Within Cluster Sum of Squares (WCSS) vs. k.
o The optimal k is at the "elbow" where the curve flattens.

Computational Complexity

O(nkId), where:

o n = number of samples
o k = number of clusters
o I = number of iterations
o d = number of attributes

Density-based Methods

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-


based clustering algorithm.
 Clusters are dense regions of data points separated by areas of low density (noise).
 Works well for arbitrary-shaped clusters and datasets with noise.

Machine Learning. Page 14


BCS602 | MACHINE LEARNING| VTU Belagavi.

Uses two parameters:

1. ε (epsilon) – Neighborhood radius.


2. m (minPts) – Minimum number of points within ε to form a cluster.

Types of Points in DBSCAN

1. Core Point
o A point with at least m points in its ε-neighborhood.
2. Border Point
o Has fewer than m points in its ε-neighborhood but is adjacent to a core point.
3. Noise Point (Outlier)
o Neither a core point nor a border point.

Density Connectivity Measures

1. Direct Density Reachability


o Point X is directly reachable from Y if:
 X is in the ε-neighborhood of Y.
 Y is a core point.
2. Densely Reachable

Machine Learning. Page 15


BCS602 | MACHINE LEARNING| VTU Belagavi.

o X is densely reachable from Y if there exists a chain of core points linking them.
3. Density Connected
o X and Y are density connected if they are both densely reachable from a
common core point Z.

Advantages of DBSCAN

1. Can detect arbitrary-shaped clusters.


2. Robust to noise and outliers.
3. Does not require specifying the number of clusters (k-means does).

Disadvantages of DBSCAN

1. Sensitive to ε and m parameters – Poor parameter choice can affect results.


2. Fails in datasets with varying density – A single ε may not work for all clusters.
3. Computationally expensive for high-dimensional data.

Grid-based Approach

 Grid-based clustering partitions space into a grid structure and fits data into cells for
clustering.
 Suitable for high-dimensional data.
 Uses subspace clustering, dense cells, and monotonicity property.

Concepts

Subspace Clustering

o Clusters are formed using a subset of features (dimensions) rather than all
attributes.
o Useful for high-dimensional data like gene expression analysis.

Machine Learning. Page 16


BCS602 | MACHINE LEARNING| VTU Belagavi.

o CLIQUE (Clustering in Quest) is a widely used grid-based subspace clustering


algorithm.

Concept of Dense Cells

o CLIQUE partitions dimensions into intervals (cells).


o A cell is dense if its data point density exceeds a threshold.
o Dense cells are merged to form clusters.

Monotonicity Property

Machine Learning. Page 17


BCS602 | MACHINE LEARNING| VTU Belagavi.

o Uses anti-monotonicity (Apriori property):


 If a k-dimensional cell is dense, then all (k-1) dimensional projections must
also be dense.
 If a lower-dimensional cell is not dense, then higher-dimensional cells
containing it are also not dense.
o Similar to association rule mining in frequent pattern mining.

Advantages of CLIQUE

1. Insensitive to input order of objects.


2. No assumptions about data distribution.
3. Finds high-density clusters in subspaces of high-dimensional data.

Disadvantage of CLIQUE

 Tuning grid parameters (grid size, density threshold) is difficult.


 Finding the optimal threshold to classify a cell as dense is challenging.

Machine Learning. Page 18


BCS602 | MACHINE LEARNING| VTU Belagavi.

Chapter :- 2

Reinforcement Learning

Overview of Reinforcement Learning

What is Reinforcement Learning?

 Reinforcement Learning (RL) is a machine learning paradigm that mimics how


humans and animals learn through experience.
 Humans interact with the environment, receive feedback (rewards or penalties), and
adjust their behavior accordingly.
 Example: A child touching fire learns to avoid it after experiencing pain (negative
reinforcement).

How RL Works in Machines

 RL simulates real-world scenarios for a computer program (agent) to learn by trial and
error.
 The agent executes actions, receives positive or negative rewards, and optimizes its
future actions based on these experiences.

Types of Reinforcement Learning

1. Positive Reinforcement Learning


o Rewards encourage good behavior (reinforce correct actions).
o Example: A robot gets +10 points for reaching a goal successfully.
o Effect: Increases the likelihood of repeating the rewarded action.
2. Negative Reinforcement Learning
o Negative rewards discourage unwanted actions.
o Example: A game agent loses -10 points for stepping into a danger zone.
o Effect: Helps the agent learn to avoid negative outcomes.

Machine Learning. Page 19


BCS602 | MACHINE LEARNING| VTU Belagavi.

Characteristics of RL

 Sequential Decision-Making: The agent makes a series of decisions to maximize total


rewards.
 Trial and Error Learning: The agent learns by exploring different actions and their
consequences.
 No Supervised Labels: Unlike supervised learning, RL does not require labeled data; it
learns from experience.

Applications of Reinforcement Learning

 Robotics: Teaching robots to walk, grasp objects, or perform complex tasks.


 Gaming: AI agents in chess, Go, and video games (e.g., AlphaGo, OpenAI Five).
 Autonomous Vehicles: Self-driving cars learn optimal driving strategies.
 Finance: AI-based trading strategies for stock markets.
 Healthcare: Personalized treatment plans based on patient responses.

Scope of Reinforcement Learning

Reinforcement Learning (RL) is well-suited for decision-making problems in dynamic and


uncertain environments. It excels in cases where an agent must learn through trial and error
and optimize its actions based on delayed rewards.

Situations Where RL Can Be Used

Pathfinding and Navigation

o Consider a grid-based game where a robot must navigate from a starting node (E) to a goal
node (G) by choosing the optimal path.
o RL can learn the best route by exploring different paths and receiving feedback on their
efficiency.

Machine Learning. Page 20


BCS602 | MACHINE LEARNING| VTU Belagavi.

o In obstacle-based games, RL can identify safe paths while avoiding dangerous zones.

Dynamic Decision-Making with Uncertainty

o RL is useful in environments where not all information is known upfront.


o It is not suitable for tasks like object detection, where a classifier with complete
labeled data performs better.

Characteristics of Reinforcement Learning

1. Sequential Decision-Making
o In RL, decisions are made in steps, and each step influences future choices.
o Example: In a maze game, a wrong turn can lead to failure.
2. Delayed Feedback
o Rewards are not always immediate; the agent may need to take multiple steps before
receiving feedback.
3. Interdependence of Actions
o Each action affects the next set of choices, meaning an incorrect move can have
long-term consequences.
4. Time-Related Decisions
o Actions are taken in a specific sequence over time, affecting the final
outcome.

Challenges in Reinforcement Learning

Reward Design

o Setting the right reward values is crucial. Incorrectly designed rewards may lead the agent
to learn undesired behavior.

Absence of a Fixed Model

Machine Learning. Page 21


BCS602 | MACHINE LEARNING| VTU Belagavi.

o Some environments, like chess, have fixed rules, but many real-world problems lack
predefined models.
o Example: Training a self-driving car requires simulations to generate experiences.

Partial Observability

o Some environments, like weather prediction, involve uncertainty because complete state
information is unavailable.

High Computational Complexity

o Games like Go involve a huge state space, making RL training time-consuming.


o More possible actions → More training time needed.

Applications of Reinforcement Learning

1. Industrial Automation
o Optimizing robot movements for efficiency.
2. Resource Management
o Allocating resources in data centers and cloud computing.
3. Traffic Light Control
o Reducing congestion by adjusting signal timings dynamically.
4. Personalized Recommendation Systems
o Used in news feeds, e-commerce, and streaming services (e.g., Netflix
recommendations).
5. Online Advertisement Bidding
o Optimizing ad placements for maximum engagement.
6. Autonomous Vehicles
o RL helps in training self-driving cars to navigate safely.
7. Game AI (Chess, Go, Dota 2, etc.)
o AI models like AlphaGo use RL to master complex games.
8. DeepMind Applications

Machine Learning. Page 22


BCS602 | MACHINE LEARNING| VTU Belagavi.

o AI systems that generate programs, images, and optimize machine learning


models.

Reinforcement Learning as Machine Learning

Reinforcement Learning (RL) is a distinct branch of machine learning that differs significantly
from supervised learning.

While supervised learning depends on labeled data, reinforcement learning learns through
interaction with the environment, making decisions based on trial and error.

Why RL Is Necessary?

Some tasks cannot be solved using supervised learning due to the absence of a labeled training
dataset. For example:

 Chess & Go: There is no dataset with all possible game moves and their
outcomes. RL allows the agent to explore and improve over time.
 Autonomous Driving: The car must learn from real-world experiences rather than
relying on a fixed dataset.

Challenges in Reinforcement Learning Compared to Supervised Learning

 More complex decision-making since every action affects future outcomes.


 Longer training times due to trial-and-error learning.
 Delayed rewards, making it difficult to attribute success or failure to a specific action.

Machine Learning. Page 23


BCS602 | MACHINE LEARNING| VTU Belagavi.

Differences between Supervised Learning and Reinforcement Learning

Components of Reinforcement Learning

Reinforcement Learning (RL) is based on an agent interacting with an environment to learn an


optimal strategy through trial and error.

Basic Components of RL

Machine Learning. Page 24


BCS602 | MACHINE LEARNING| VTU Belagavi.

1. Agent – The decision-maker (e.g., a robot, self-driving car, AI player in a game).


2. Environment – The external world where the agent interacts (e.g., a game board, real-
world traffic).
3. State (S) – A representation of the environment at a specific time.
4. Actions (A) – The possible choices available to the agent.
5. Rewards (R) – The feedback signal received by the agent for taking an action.
6. Policy (π) – The agent’s strategy for selecting actions based on states.
7. Episodes – The sequence of states, actions, and rewards from the start state to the goal
state.

Types of RL Problems
Learning Problems

 Unknown environment – The agent learns by trial and error.


 Goal – Improve the policy through interaction.
 Example – A robot navigating through an unknown maze.

Planning Problems

 Known environment – The agent can compute and improve the policy using a model.
 Example – Chess AI that plans its moves based on game rules.

Environment and Agent

 The environment contains all elements the agent interacts with, including
obstacles, rewards, and state transitions.
 The agent makes decisions and performs actions to maximize rewards.

Example

In self-driving cars,

Machine Learning. Page 25


BCS602 | MACHINE LEARNING| VTU Belagavi.

 The environment includes roads, traffic, and signals.


 The agent is the AI system making driving decisions.

States and Actions

 State (S) – Represents the current situation.


 Action (A) – Causes a transition from one state to another.

Example (Navigation)

In a grid-based game, states represent positions (A, B, C, etc.), and actions are movements (UP,
DOWN, LEFT, RIGHT).

Types of States

1. Start State – Where the agent begins.


2. Goal State – The target state with the highest reward.
3. Non-terminal States – Intermediate steps between start and goal.

Machine Learning. Page 26


BCS602 | MACHINE LEARNING| VTU Belagavi.

Types of Episodes

 Episodic – Has a definite start and goal state (e.g., solving a maze).
 Continuous – No fixed goal state; the task continues indefinitely (e.g., stock
trading).

Policies in RL

A policy (π) is the strategy used by the agent to choose actions. Types of

Policies

Choosing the Best Policy

 The optimal policy is the one that maximizes cumulative expected rewards.

Rewards in RL

 Immediate Reward (r) – The instant feedback for an action.


 Total Reward (G) – The sum of all rewards collected during an episode.
 Long-term Reward – The cumulative future reward.

Machine Learning. Page 27


BCS602 | MACHINE LEARNING| VTU Belagavi.

Discount Factor (γ)

RL Algorithm Categories

 Model-Based RL – Uses a predefined model (e.g., Chess AI).


 Model-Free RL – Learns by trial and error (e.g., a robot navigating an unknown
environment).

Markov Decision Process

A Markov Chain is a stochastic process that satisfies the Markov property.

It consists of a sequence of random variables where the probability of transitioning to the next
state depends only on the current state and not on the past states.

Machine Learning. Page 28


BCS602 | MACHINE LEARNING| VTU Belagavi.

Example: University Transition

Consider two universities:

 80% of students from University A move to University B for a master's degree, while
20% remain in University A.
 60% of students from University B move to University A, while 40% remain in
University B.

This can be represented as a Markov Chain, where:

 States represent the universities.


 Edges denote the probability of transitioning between states.

A transition matrix at time is defined as:

Each row represents a probability distribution, meaning the sum of elements in each row equals
1.

Probability Prediction

Let the initial distribution be:

To find the state distribution after one time step

After two time steps:

Machine Learning. Page 29


BCS602 | MACHINE LEARNING| VTU Belagavi.

The system stabilizes over time, reflecting the equilibrium distribution.

Markov Decision Process (MDP)

An MDP extends a Markov Chain by incorporating rewards. It consists of:

1. Set of states
2. Set of actions
3. Transition probability function
4. Reward function
5. Policy
6. Value function

Markov Assumption

The Markov property states that the probability of reaching a state and receiving a reward
depends only on the previous state and action :

MDP Process

1. Observe the current state .


2. Choose an action .
3. Receive a reward .
4. Move to the next state .
5. Repeat to maximize cumulative rewards.

State Transition Probability

Machine Learning. Page 30


BCS602 | MACHINE LEARNING| VTU Belagavi.

The probability of moving from state to after taking action is given by:

This forms a state transition matrix, where each row represents transition probabilities from one
state to another.

Expected Reward

The expected reward for an action in state is given by:

Training and Testing of RL Systems

Once an MDP is modeled, the system undergoes:

1. Training: The agent repeatedly interacts with the environment, adjusting


parameters based on rewards.
2. Inference: A trained model is deployed to make decisions in real-time.
3. Retraining: When the environment changes, the model is retrained to adapt and
improve performance.

Goal of MDP

The agent's objective is to maximize total accumulated rewards over time by following an
optimal policy.

Machine Learning. Page 31


BCS602 | MACHINE LEARNING| VTU Belagavi.

Multi-Arm Bandit Problem and Reinforcement Problem Types Reinforcement

Learning Overview

Reinforcement learning (RL) uses trial and error to learn a series of actions that maximize the
total reward. RL consists of two fundamental sub-problems:

Prediction (Value Estimation):

o The goal is to predict the total reward (return), also known as policy evaluation or value
estimation.
o This requires the formulation of a function called the state-value function.
o The estimation of the state-value function can be performed using Temporal
Difference (TD) Learning.

Policy Improvement:

o The objective is to determine actions that maximize returns.


o This process is known as policy improvement.
o Both prediction and policy improvement can be combined into policy iteration, where these
steps are used alternately to find an optimal policy.

Multi-Arm Bandit Problem

A commonly encountered problem in reinforcement learning is the multi-arm bandit problem


(or N-arm bandit problem).

Consider a hypothetical casino with a robotic arm that activates a 5-armed slot machine. When a
lever is pulled, the machine returns a reward within a specified range (e.g., $1 to
$10).

The challenge is that each arm provides rewards randomly within this range.

Machine Learning. Page 32


BCS602 | MACHINE LEARNING| VTU Belagavi.

Objective:

Given a limited number of attempts, the goal is to maximize the total reward by selecting the best
lever.

A logical approach is to determine which lever has the highest average reward and use it repeatedly.

Formalization:

Given k attempts on an N-arm slot machine, with rewards , the expected reward (action- value
function) is:

The best action is defined as:

This indicates the action that returns the highest average reward and is used as an indicator of
action quality.

Example:

If a slot machine is chosen five times and returns rewards , the quality of this action is:

Machine Learning. Page 33


BCS602 | MACHINE LEARNING| VTU Belagavi.

Exploration vs Exploitation and Selection Policies

In reinforcement learning, an agent must decide how to select actions:

Exploration:

o Tries all actions, even if they lead to sub-optimal decisions.


o Useful in games where exploring different actions provides better long-term
rewards.
o Risky but informative.

Exploitation:

o Uses the current best-known action repeatedly.


o Focuses on short-term gains.
o Simple but often sub-optimal.

A balance between exploration and exploitation is crucial for optimal decision-making.

Selection Policies

Greedy Method

 Picks the best-known action at any given time.


 Based solely on exploitation.
 Risk: It may miss out on exploring better options. ε-

Greedy Method

 Balances exploration and exploitation.


 With probability ε, the agent explores a random action.
 With probability 1 - ε, it selects the best-known action.
 ε ranges from 0 to 1 (e.g., ε = 0.1 means a 10% chance of exploration).

Machine Learning. Page 34


BCS602 | MACHINE LEARNING| VTU Belagavi.

Reinforcement Learning Agent Types

An RL agent can be classified into different approaches based on how it learns:

1. Value-Based Approaches
o Optimize the value function , which represents the maximum expected future reward
from a given state.
o Uses discount factors to prioritize future rewards.
2. Policy-Based Approaches
o Focus on finding the optimal policy , a function that maps states to actions.
o Rather than estimating values, it directly learns which action to take.
3. Actor-Critic Methods (Hybrid Approaches)
o Combine value-based and policy-based methods.
o The actor updates the policy, while the critic evaluates it.
4. Model-Based Approaches
o Create a model of the environment (e.g., using Markov Decision Processes
(MDPs)).
o Use simulations to plan the best actions.
5. Model-Free Approaches

Machine Learning. Page 35


BCS602 | MACHINE LEARNING| VTU Belagavi.

o No predefined model of the environment.


o Use methods like Temporal Differencing (TD) Learning and Monte Carlo
methods to estimate values from experience.

Reinforcement Algorithm Selection

The choice of a reinforcement learning algorithm depends on factors such as:

 Availability of models
 Nature of updates (incremental vs. batch learning)
 Exploration vs. exploitation trade-offs
 Computational efficiency

Model-based Learning

Passive Learning refers to a model-based environment, where the environment is known. This
means that for any given state, the next state and action probability distribution are known.

Markov Decision Process (MDP) and Dynamic Programming are powerful tools for solving
reinforcement learning problems in this context.

Machine Learning. Page 36


BCS602 | MACHINE LEARNING| VTU Belagavi.

The mathematical foundation for passive learning is provided by MDP. These model- based
reinforcement learning problems can be solved using dynamic programming after constructing
the model with MDP.

The primary objective in reinforcement learning is to take an action a that transitions the system
from the current state to the end state while maximizing rewards. These rewards can be positive
or negative.

The goal is to maximize expected rewards by choosing the optimal policy: for all

possible values of s at time t.

Policy and Value Functions

An agent in reinforcement learning has multiple courses of action for a given state. The way the
agent behaves is determined by its policy.

A policy is a distribution over all possible actions with probabilities assigned to each action.

Different actions yield different rewards. To quantify and compare these rewards, we use value
functions.

Value Function Notation

A value function summarizes possible future scenarios by averaging expected returns under a
given policy π.

It is a prediction of future rewards and computes the expected sum of future rewards for a given
state s under policy π:

Machine Learning. Page 37


BCS602 | MACHINE LEARNING| VTU Belagavi.

where v(s) represents the quality of the state based on a long-term strategy. Example

If we have two states with values 0.2 and 0.9, the state with 0.9 is a better state to be in. Value

functions can be of two types:

 State-Value Function (for a state)


 State-Action Function (for a state-action pair)

State-Value Function

Denoted as v(s), the state-value function of an MDP is the expected return from state s under a
policy π:

This function accumulates all expected rewards, potentially discounted over time, and helps
determine the goodness of a state.

The optimal state-value function is given by:

Action-Value Function (Q-Function)

Apart from v(s), another function called the Q-function is used. This function returns a real
value indicating the total expected reward when an agent:

1. Starts in state s
2. Takes action a
3. Follows a policy π afterward

Machine Learning. Page 38


BCS602 | MACHINE LEARNING| VTU Belagavi.

Bellman Equation

Dynamic programming methods require a recursive formulation of the problem. The recursive
formulation of the state-value function is given by the Bellman equation:

Solving Reinforcement Problems

There are two main algorithms for solving reinforcement learning problems using conventional
methods:

1. Value Iteration
2. Policy Iteration

Value Iteration

Value iteration estimates v(s) iteratively:

Machine Learning. Page 39


BCS602 | MACHINE LEARNING| VTU Belagavi.

Algorithm

1. Initialize v(s) arbitrarily (e.g., all zeros).


2. Iterate until convergence:
o For each state s, update v(s) using the Bellman equation.
o Repeat until changes are negligible.

Policy Iteration

Policy iteration consists of two main steps:

1. Policy Evaluation
2. Policy Improvement

Policy Evaluation

Initially, for a given policy π, the algorithm starts with v(s) = 0 (no reward). The Bellman
equation is used to obtain v(s), and the process continues iteratively until the optimal v(s) is
found.

Policy Improvement

The policy improvement process is performed as follows:

1. Evaluate the current policy using policy evaluation.


2. Solve the Bellman equation for the current policy to obtain v(s).
3. Improve the policy by applying the greedy approach to maximize expected
rewards.
4. Repeat the process until the policy converges to the optimal policy.

Algorithm

1. Start with an arbitrary policy π.


2. Perform policy evaluation using Bellman’s equation.

Machine Learning. Page 40


BCS602 | MACHINE LEARNING| VTU Belagavi.

3. Improve the policy greedily.


4. Repeat until convergence.

Model Free Methods

Model-free methods do not require complete knowledge of the environment. Instead, they learn
through experience and interaction with the environment.

The reward determination in model-free methods can be categorized into three formulations:

1. Episodic Formulation: Rewards are assigned based on the outcome of an entire episode. For
example, if a game is won, all actions in the episode receive a positive reward (+1). If lost, all
actions receive a negative reward (-1). However, this approach may unfairly penalize or
reward intermediate actions.
2. Continuous Formulation: Rewards are determined immediately after an action. An
example is the multi-armed bandit problem, where an immediate reward between $1
- $10 can be given after each action.
3. Discounted Returns: Long-term rewards are considered using a discount factor. This method
is often used in reinforcement learning algorithms.

Model-free methods primarily utilize the following techniques:

 Monte Carlo (MC) Methods


 Temporal Difference (TD) Learning

Monte-Carlo Methods

Monte Carlo (MC) methods do not assume any predefined model, making them purely
experience-driven. This approach is analogous to how humans and animals learn from interactions
with their environment.

Machine Learning. Page 41


BCS602 | MACHINE LEARNING| VTU Belagavi.

Characteristics of Monte Carlo Methods:

 Experience is divided into episodes, where each episode is a sequence of states from a
starting state to a goal state.
 Episodes must terminate; regardless of the starting point, an episode must reach an
endpoint.
 Value-action functions are computed only after the completion of an episode, making
MC an incremental method.
 MC methods compute rewards at the end of an episode to estimate maximum expected
future rewards.
 Empirical mean is used instead of expected return; the total return over multiple
episodes is averaged.
 Due to the non-stationary nature of environments, value functions are computed for a
fixed policy and revised using dynamic programming.

Monte Carlo Mean Value Computation:

The mean value of a state is calculated as:

Incremental Monte Carlo Update:

The value function is updated incrementally using the following formula:

Machine Learning. Page 42


BCS602 | MACHINE LEARNING| VTU Belagavi.

Temporal Difference (TD) Learning

Temporal Difference (TD) Learning is an alternative to Monte Carlo methods. It is also a model-
free technique that learns from experience and interaction with the environment.

Characteristics of TD Learning:

 Bootstrapping Method: Updates are based on the current estimate and future reward.
 Incremental Updates: Unlike MC, which waits until the end of an episode, TD
updates values at each step.
 More Efficient: TD can learn before an episode ends, making it more sample-
efficient than MC methods.
 Used for Non-Stationary Problems: Suitable for environments where conditions
change over time.

Differences between Monte Carlo and TD Learning

Machine Learning. Page 43


BCS602 | MACHINE LEARNING| VTU Belagavi.

Eligibility Traces and TD(λ)

TD Learning can be accelerated using eligibility traces, which allow updates to be spread over
multiple states. This leads to a family of algorithms called TD(λ), where λ is the decay parameter
(0 ≤ λ ≤ 1):

 λ = 0: Only the previous prediction is updated.


 λ = 1: All previous predictions are updated.

By incorporating eligibility traces, TD(λ) provides an alternative short-term memory mechanism


to enhance learning efficiency.

Machine Learning. Page 44


BCS602 | MACHINE LEARNING| VTU Belagavi.

Q-Learning

Q-Learning Algorithm

1. Initialize Q-table:
o Create a table Q(s,a) with states s and actions a.
o Initialize Q-values with random or zero values.
2. Set parameters:
o Learning rate α\alphaα (typically between 0 and 1).
o Discount factor γ (typically close to 1).
o Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
o Start from an initial state s.
o Repeat until reaching a terminal state:

4. End the training once convergence is reached (Q-values become stable).

This iterative process helps the agent learn optimal Q-values, which guide it to take actions that
maximize rewards.

Machine Learning. Page 45


BCS602 | MACHINE LEARNING| VTU Belagavi.

SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)

Initialize Q-table:

o Create a table Q(s,a) for all state-action pairs.

Machine Learning. Page 46


BCS602 | MACHINE LEARNING| VTU Belagavi.

o Initialize Q-values with random or zero values.

Set parameters:

o Learning rate α (typically between 0 and 1).


o Discount factor γ (typically close to 1).
o Exploration-exploitation strategy (e.g., ε-greedy policy).

Repeat for each episode:

o Start from an initial state s.


o Choose an action a using the ε-greedy policy.

Repeat until the terminal state is reached:

End the training when Q-values converge. Differences

between SARSA and Q-Learning

Machine Learning. Page 47

You might also like