ML Mod 5
ML Mod 5
Module-5
This is done using a trial and error approach as there are no supervisors available as in
classification.
The characteristic of clustering is that the objects in the clusters or groups are similar to
each other within the clusters while differ from the objects in other clusters
significantly.
The input for cluster analysis is examples or samples. These are known as objects, data
All these terms are same and used interchangeably in this chapter. All the samples or objects
The output is the set of clusters (or groups) of similar data if it exists in the input.
For example, the following Figure 13.1(a) shows data points or samples with two features
shown in different shaded samples and Figure 13.1(b) shows the manually drawn ellipse to
Visual identification of clusters in this case is easy as the examples have only two features.
But, when examples have more features, say 100, then clustering cannot be done manually and
automatic clustering algorithms are required.
Also, automating the clustering process is desirable as these tasks are considered difficult by
humans and almost impossible. All clusters are repre- sented by centroids.
Example: For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then the centroid
is given as.
The clusters should not overlap and every cluster should represent only one class. Therefore,
clustering algorithms use trial and error method to form clusters that can be converted to labels.
Applications of Clustering
High-Dimensional Data
Scalability Issue
o Some algorithms perform well for small datasets but fail for large-scale data.
Unit Inconsistency
Proximity Measures
Variables
Binary Attributes
Categorical Variables
Ordinal Variables
Cosine Similarity
Distance Measures
Overview
o Merges clusters based on the smallest distance between two points from different clusters.
o Related to the Minimum Spanning Tree (MST).
No model assumptions
Only one parameter of the window, that is, bandwidth is required Robust to noise No
Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. If
it is small, then many points are missed and convergence occurs as the problem.
The number of clusters cannot be specified and user has no control over this parameter.
1. Initialization
o Select k initial cluster centers (randomly or using prior knowledge).
o Normalize data for better performance.
2. Assignment of Data Points
o Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids
Mathematical Optimization
Advantages
Disadvantages
Computational Complexity
O(nkId), where:
o n = number of samples
o k = number of clusters
o I = number of iterations
o d = number of attributes
Density-based Methods
1. Core Point
o A point with at least m points in its ε-neighborhood.
2. Border Point
o Has fewer than m points in its ε-neighborhood but is adjacent to a core point.
3. Noise Point (Outlier)
o Neither a core point nor a border point.
o X is densely reachable from Y if there exists a chain of core points linking them.
3. Density Connected
o X and Y are density connected if they are both densely reachable from a
common core point Z.
Advantages of DBSCAN
Disadvantages of DBSCAN
Grid-based Approach
Grid-based clustering partitions space into a grid structure and fits data into cells for
clustering.
Suitable for high-dimensional data.
Uses subspace clustering, dense cells, and monotonicity property.
Concepts
Subspace Clustering
o Clusters are formed using a subset of features (dimensions) rather than all
attributes.
o Useful for high-dimensional data like gene expression analysis.
Monotonicity Property
Advantages of CLIQUE
Disadvantage of CLIQUE
Chapter :- 2
Reinforcement Learning
RL simulates real-world scenarios for a computer program (agent) to learn by trial and
error.
The agent executes actions, receives positive or negative rewards, and optimizes its
future actions based on these experiences.
Characteristics of RL
o Consider a grid-based game where a robot must navigate from a starting node (E) to a goal
node (G) by choosing the optimal path.
o RL can learn the best route by exploring different paths and receiving feedback on their
efficiency.
o In obstacle-based games, RL can identify safe paths while avoiding dangerous zones.
1. Sequential Decision-Making
o In RL, decisions are made in steps, and each step influences future choices.
o Example: In a maze game, a wrong turn can lead to failure.
2. Delayed Feedback
o Rewards are not always immediate; the agent may need to take multiple steps before
receiving feedback.
3. Interdependence of Actions
o Each action affects the next set of choices, meaning an incorrect move can have
long-term consequences.
4. Time-Related Decisions
o Actions are taken in a specific sequence over time, affecting the final
outcome.
Reward Design
o Setting the right reward values is crucial. Incorrectly designed rewards may lead the agent
to learn undesired behavior.
o Some environments, like chess, have fixed rules, but many real-world problems lack
predefined models.
o Example: Training a self-driving car requires simulations to generate experiences.
Partial Observability
o Some environments, like weather prediction, involve uncertainty because complete state
information is unavailable.
1. Industrial Automation
o Optimizing robot movements for efficiency.
2. Resource Management
o Allocating resources in data centers and cloud computing.
3. Traffic Light Control
o Reducing congestion by adjusting signal timings dynamically.
4. Personalized Recommendation Systems
o Used in news feeds, e-commerce, and streaming services (e.g., Netflix
recommendations).
5. Online Advertisement Bidding
o Optimizing ad placements for maximum engagement.
6. Autonomous Vehicles
o RL helps in training self-driving cars to navigate safely.
7. Game AI (Chess, Go, Dota 2, etc.)
o AI models like AlphaGo use RL to master complex games.
8. DeepMind Applications
Reinforcement Learning (RL) is a distinct branch of machine learning that differs significantly
from supervised learning.
While supervised learning depends on labeled data, reinforcement learning learns through
interaction with the environment, making decisions based on trial and error.
Why RL Is Necessary?
Some tasks cannot be solved using supervised learning due to the absence of a labeled training
dataset. For example:
Chess & Go: There is no dataset with all possible game moves and their
outcomes. RL allows the agent to explore and improve over time.
Autonomous Driving: The car must learn from real-world experiences rather than
relying on a fixed dataset.
Basic Components of RL
Types of RL Problems
Learning Problems
Planning Problems
Known environment – The agent can compute and improve the policy using a model.
Example – Chess AI that plans its moves based on game rules.
The environment contains all elements the agent interacts with, including
obstacles, rewards, and state transitions.
The agent makes decisions and performs actions to maximize rewards.
Example
In self-driving cars,
Example (Navigation)
In a grid-based game, states represent positions (A, B, C, etc.), and actions are movements (UP,
DOWN, LEFT, RIGHT).
Types of States
Types of Episodes
Episodic – Has a definite start and goal state (e.g., solving a maze).
Continuous – No fixed goal state; the task continues indefinitely (e.g., stock
trading).
Policies in RL
A policy (π) is the strategy used by the agent to choose actions. Types of
Policies
The optimal policy is the one that maximizes cumulative expected rewards.
Rewards in RL
RL Algorithm Categories
It consists of a sequence of random variables where the probability of transitioning to the next
state depends only on the current state and not on the past states.
80% of students from University A move to University B for a master's degree, while
20% remain in University A.
60% of students from University B move to University A, while 40% remain in
University B.
Each row represents a probability distribution, meaning the sum of elements in each row equals
1.
Probability Prediction
1. Set of states
2. Set of actions
3. Transition probability function
4. Reward function
5. Policy
6. Value function
Markov Assumption
The Markov property states that the probability of reaching a state and receiving a reward
depends only on the previous state and action :
MDP Process
The probability of moving from state to after taking action is given by:
This forms a state transition matrix, where each row represents transition probabilities from one
state to another.
Expected Reward
Goal of MDP
The agent's objective is to maximize total accumulated rewards over time by following an
optimal policy.
Learning Overview
Reinforcement learning (RL) uses trial and error to learn a series of actions that maximize the
total reward. RL consists of two fundamental sub-problems:
o The goal is to predict the total reward (return), also known as policy evaluation or value
estimation.
o This requires the formulation of a function called the state-value function.
o The estimation of the state-value function can be performed using Temporal
Difference (TD) Learning.
Policy Improvement:
Consider a hypothetical casino with a robotic arm that activates a 5-armed slot machine. When a
lever is pulled, the machine returns a reward within a specified range (e.g., $1 to
$10).
The challenge is that each arm provides rewards randomly within this range.
Objective:
Given a limited number of attempts, the goal is to maximize the total reward by selecting the best
lever.
A logical approach is to determine which lever has the highest average reward and use it repeatedly.
Formalization:
Given k attempts on an N-arm slot machine, with rewards , the expected reward (action- value
function) is:
This indicates the action that returns the highest average reward and is used as an indicator of
action quality.
Example:
If a slot machine is chosen five times and returns rewards , the quality of this action is:
Exploration:
Exploitation:
Selection Policies
Greedy Method
Greedy Method
1. Value-Based Approaches
o Optimize the value function , which represents the maximum expected future reward
from a given state.
o Uses discount factors to prioritize future rewards.
2. Policy-Based Approaches
o Focus on finding the optimal policy , a function that maps states to actions.
o Rather than estimating values, it directly learns which action to take.
3. Actor-Critic Methods (Hybrid Approaches)
o Combine value-based and policy-based methods.
o The actor updates the policy, while the critic evaluates it.
4. Model-Based Approaches
o Create a model of the environment (e.g., using Markov Decision Processes
(MDPs)).
o Use simulations to plan the best actions.
5. Model-Free Approaches
Availability of models
Nature of updates (incremental vs. batch learning)
Exploration vs. exploitation trade-offs
Computational efficiency
Model-based Learning
Passive Learning refers to a model-based environment, where the environment is known. This
means that for any given state, the next state and action probability distribution are known.
Markov Decision Process (MDP) and Dynamic Programming are powerful tools for solving
reinforcement learning problems in this context.
The mathematical foundation for passive learning is provided by MDP. These model- based
reinforcement learning problems can be solved using dynamic programming after constructing
the model with MDP.
The primary objective in reinforcement learning is to take an action a that transitions the system
from the current state to the end state while maximizing rewards. These rewards can be positive
or negative.
The goal is to maximize expected rewards by choosing the optimal policy: for all
An agent in reinforcement learning has multiple courses of action for a given state. The way the
agent behaves is determined by its policy.
A policy is a distribution over all possible actions with probabilities assigned to each action.
Different actions yield different rewards. To quantify and compare these rewards, we use value
functions.
A value function summarizes possible future scenarios by averaging expected returns under a
given policy π.
It is a prediction of future rewards and computes the expected sum of future rewards for a given
state s under policy π:
where v(s) represents the quality of the state based on a long-term strategy. Example
If we have two states with values 0.2 and 0.9, the state with 0.9 is a better state to be in. Value
State-Value Function
Denoted as v(s), the state-value function of an MDP is the expected return from state s under a
policy π:
This function accumulates all expected rewards, potentially discounted over time, and helps
determine the goodness of a state.
Apart from v(s), another function called the Q-function is used. This function returns a real
value indicating the total expected reward when an agent:
1. Starts in state s
2. Takes action a
3. Follows a policy π afterward
Bellman Equation
Dynamic programming methods require a recursive formulation of the problem. The recursive
formulation of the state-value function is given by the Bellman equation:
There are two main algorithms for solving reinforcement learning problems using conventional
methods:
1. Value Iteration
2. Policy Iteration
Value Iteration
Algorithm
Policy Iteration
1. Policy Evaluation
2. Policy Improvement
Policy Evaluation
Initially, for a given policy π, the algorithm starts with v(s) = 0 (no reward). The Bellman
equation is used to obtain v(s), and the process continues iteratively until the optimal v(s) is
found.
Policy Improvement
Algorithm
Model-free methods do not require complete knowledge of the environment. Instead, they learn
through experience and interaction with the environment.
The reward determination in model-free methods can be categorized into three formulations:
1. Episodic Formulation: Rewards are assigned based on the outcome of an entire episode. For
example, if a game is won, all actions in the episode receive a positive reward (+1). If lost, all
actions receive a negative reward (-1). However, this approach may unfairly penalize or
reward intermediate actions.
2. Continuous Formulation: Rewards are determined immediately after an action. An
example is the multi-armed bandit problem, where an immediate reward between $1
- $10 can be given after each action.
3. Discounted Returns: Long-term rewards are considered using a discount factor. This method
is often used in reinforcement learning algorithms.
Monte-Carlo Methods
Monte Carlo (MC) methods do not assume any predefined model, making them purely
experience-driven. This approach is analogous to how humans and animals learn from interactions
with their environment.
Experience is divided into episodes, where each episode is a sequence of states from a
starting state to a goal state.
Episodes must terminate; regardless of the starting point, an episode must reach an
endpoint.
Value-action functions are computed only after the completion of an episode, making
MC an incremental method.
MC methods compute rewards at the end of an episode to estimate maximum expected
future rewards.
Empirical mean is used instead of expected return; the total return over multiple
episodes is averaged.
Due to the non-stationary nature of environments, value functions are computed for a
fixed policy and revised using dynamic programming.
Temporal Difference (TD) Learning is an alternative to Monte Carlo methods. It is also a model-
free technique that learns from experience and interaction with the environment.
Characteristics of TD Learning:
Bootstrapping Method: Updates are based on the current estimate and future reward.
Incremental Updates: Unlike MC, which waits until the end of an episode, TD
updates values at each step.
More Efficient: TD can learn before an episode ends, making it more sample-
efficient than MC methods.
Used for Non-Stationary Problems: Suitable for environments where conditions
change over time.
TD Learning can be accelerated using eligibility traces, which allow updates to be spread over
multiple states. This leads to a family of algorithms called TD(λ), where λ is the decay parameter
(0 ≤ λ ≤ 1):
Q-Learning
Q-Learning Algorithm
1. Initialize Q-table:
o Create a table Q(s,a) with states s and actions a.
o Initialize Q-values with random or zero values.
2. Set parameters:
o Learning rate α\alphaα (typically between 0 and 1).
o Discount factor γ (typically close to 1).
o Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
o Start from an initial state s.
o Repeat until reaching a terminal state:
This iterative process helps the agent learn optimal Q-values, which guide it to take actions that
maximize rewards.
SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)
Initialize Q-table:
Set parameters: