Unit 4
Memory Based Learning
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm
generally used for classification but can also be used for regression tasks. It
works by finding the "k" closest data points (neighbors) to a given input and
makesa predictions based on the majority class (for classification) or the average
value (for regression). Since KNN makes no assumptions about the underlying
data distribution it makes it a non-parametric and instance-based learning
method.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time
of classification it performs an action on the dataset.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm
how many nearby points or neighbors to look at when it makes a decision.
Example: Imagine you're deciding which fruit it is based on its shape and size.
You compare it to fruits you already know.
If k = 3, the algorithm looks at the 3 closest fruits to the new one.
If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new
fruit is an apple because most of its neighbors are apples.
How to choose the value of k for KNN Algorithm?
The value of k in KNN decides how many neighbors the algorithm looks at
when making a prediction.
Choosing the right k is important for good results.
If the data has lots of noise or outliers, using a larger k can make the
predictions more stable.
But if k is too large the model may become too simple and miss important
patterns and this is called underfitting.
So k should be picked carefully based on the data.
Statistical Methods for Selecting k
Cross-Validation: Cross-Validation is a good way to find the best value of k
is by using k-fold cross-validation. This means dividing the dataset into k
parts. The model is trained on some of these parts and tested on the remaining
ones. This process is repeated for each part. The k value that gives the highest
average accuracy during these tests is usually the best one to use.
Elbow Method: In Elbow Method we draw a graph showing the error rate or
accuracy for different k values. As k increases the error usually drops at first.
But after a certain point error stops decreasing quickly. The point where the
curve changes direction and looks like an "elbow" is usually the best choice for
k.
Odd Values for k: It’s a good idea to use an odd number for k especially in
classification problems. This helps avoid ties when deciding which class is the
most common among the neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used
for classification and regression task. To identify nearest neighbor we use below
distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a
plane or space. You can think of it like the shortest path you would walk if you
were to go directly from one point to another.
distance(x,Xi)=∑j=1d(xj−Xij)2]distance(x,Xi)=∑j=1d(xj−Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along
horizontal and vertical lines like a grid or city streets. It’s also called "taxicab
distance" because a taxi can only drive along the grid-like streets of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean
and Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the Euclidean
distance formula and when p=1, it turns into the Manhattan distance formula.
Minkowski distance is essentially a flexible formula that can represent either
Euclidean or Manhattan distance depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity
where it predicts the label or value of a new data point by considering the labels
or values of its K nearest neighbors in the training dataset.
Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered while
making prediction.
Step 2: Calculating distance
To measure the similarity between target and training data points Euclidean
distance is used. Distance is calculated between data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
When you want to classify a data point into a category like spam or not spam,
the KNN algorithm looks at the K closest points in the dataset. These closest
points are called neighbors. The algorithm then looks at which category the
neighbors belong to and picks the one that appears the most. This is called
majority voting.
In regression, the algorithm still looks for the K closest points. But instead of
voting for a class in classification, it takes the average of the values of those K
neighbors. This average is the predicted value for the new point for the
algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test
point moves the algorithm identifies the closest 'k' data points i.e. 5 in this case
and assigns test point the majority class label that is grey label class here.
Locally Weighted Regression
Locally weighted linear regression is the nonparametric regression methods that
combine k-nearest neighbor based machine learning. It is referred to as locally
weighted because for a query point the function is approximated on the basis of
data near that and weighted because the contribution is weighted by its distance
from the query point.
Locally Weighted Regression (LWR) is a non-parametric, memory-based
algorithm, which means it explicitly retains training data and used it for every
time a prediction is made.
To explain the locally weighted linear regression, we first need to understand the
linear regression. The linear regression can be explained with the following
equations:
Let (xi, yi) be the query point, then for minimizing the cost function in the linear
regression:
J(θ)=∑i=1m(y(i)−θTx(i))2
by calculating θθ so, that it minimize the above cost function.
Our output will be: θTxθTx
Thus, the formula for calculating \theta can also be:
θ=(XTX)−1XTYθ=(XTX)−1XTY
where, beta is the vector of linear vector, X, Y is the matrix, and vector of all
observations.
For locally weighted linear regression:
J(θ)=∑i=1mwi(y(i)−θTx(i))2J(θ)=∑i=1mwi(y(i)−θTx(i))2
by calculating
so, that it minimize the above cost function.
Our output will be: θTxθTx
Here, w(i) is the weight associated with each observation of training data. It can
be calculated by the given formula:
Or this can be represented in the form of a matrix calculation:
Impact of Bandwidth
where x(i) is the observation from the training data and x is a particular point
from which the distance is calculated and T(tau) is the bandwidth. Here, T(tau)
decides the amount of fitness in the function, if the function is closely fitted, its
value will be small. Therefore,
then, we can calculate \theta with the following equation:
Radial Basis Functions
Radial Basis Function (RBF) Neural Networks are used for function
approximation tasks. They are a special category of feed-forward neural networks
comprising of three layers. Due to this distinct three-layer architecture and
universal approximation capabilities they offer faster learning speeds and
efficient performance in classification and regression problems.
How Do RBF Networks Work?
RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models
though their implementation is distinct. The fundamental idea is that nearby items
with similar predictor variable values influence an item's predicted target value.
Here’s how RBF Networks operate:
1. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector
from the training set. The network computes the Euclidean distance between
the input vector and each neuron's center.
3. Activation Function: The Euclidean distance is transformed using a Radial
Basis Function (typically a Gaussian function) to compute the neuron’s
activation value. This value decreases exponentially as the distance increases.
4. Output Nodes: Each output node calculates a score based on a weighted sum
of the activation values from all RBF neurons. For classification the category
with the highest score is chosen.
For example, consider a dataset with two-dimensional data points from two
classes. An RBF Network trained with 20 neurons will have each neuron
representing a prototype in the input space. The network computes category
scores which can be visualized using 3-D mesh or contour plots. We assign
positive weights to neurons in the same category and negative weights to neurons
in different categories. The decision boundary can be plotted by evaluating scores
over a grid.
Key Characteristics of RBFs
Radial Basis Functions: These are real-valued functions dependent solely on
the distance from a central point. The Gaussian function is the most commonly
used type.
Dimensionality: The network's dimensions correspond to the number of
predictor variables.
Center and Radius: Each RBF neuron has a center and a radius (spread). The
radius affects how broadly each neuron influences the input space.
Architecture of RBF Networks
The architecture of an RBF Network typically consists of three layers:
Input Layer
Function: After receiving the input features the input layer sends them straight
to the hidden layer.
Components: It is made up of the same number of neurons as the
characteristics in the input data. One feature of the input vector corresponds to
each neuron in the input layer.
Hidden Layer
Function: This layer uses radial basis functions (RBFs) to conduct the non-
linear transformation of the input data.
Components: Neurons in the buried layer apply the RBF to the incoming data.
The Gaussian function is the RBF that is most frequently utilized.
RBF Neurons: Every neuron in the hidden layer has a spread parameter (σ)
and a center which are also referred to as prototype vectors. The spread
parameter modulates the distance between the center of an RBF neuron and the
input vector which in turn determines the neuron's output.
Output Layer
Function: The output layer uses weighted sums to integrate the hidden layer
neurons outputs to create the network's final output.
Components: It is made up of neurons that combine the outputs of the hidden
layer in a linear fashion. To reduce the error between the network's predictions
and the actual target values, the weights of these combinations are changed
during training.
Implementing Radial Basis Function Neural Network
An RBF neural network must be trained in three stages: choosing the center's,
figuring out the spread parameters and training the output weights.
Step 1: Selecting the Centers
Techniques for Centre Selection: Centre's can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
K-Means Clustering: The center's of these clusters are employed as the
center's for the RBF neurons in this widely used center selection technique
which groups the input data into k groups.
Step 2: Determining the Spread Parameters
The spread parameter (σ) governs each RBF neuron's area of effect and
establishes the width of the RBF.
Calculation: The spread parameter can be manually adjusted for each neuron
or set as a constant for all neurons. Setting σ based on the separation between
the center's is a popular method, frequently accomplished with the help of a
heuristic like dividing the greatest distance between canters by the square root
of twice the number of center's
Step 3: Training the Output Weights
Linear Regression: The objective of linear regression techniques which are
commonly used to estimate the output layer weights, is to minimize the error
between the anticipated output and the actual target values.
Pseudo-Inverse Method: One popular technique for figuring out the weights
is to utilize the pseudo-inverse of the hidden layer outputs matrix.
Case Based Learning in Machine Learning
Case-Based Learning (CBL) in Machine Learning (ML) is a method where a
model solves new problems by comparing them with previously encountered cases
(examples). It is a memory-based learning approach that doesn't explicitly learn
a model during training but instead stores instances (cases) and defers
generalization until a new query is presented.
🔍 What is Case-Based Learning?
Case-Based Learning (CBL) is inspired by human reasoning: when we encounter
a problem, we think of similar past experiences (cases) and reuse the knowledge
from them to make decisions.
Cases = stored experiences or examples, typically represented as feature
vectors with corresponding labels.
When a new problem (query) is encountered, the system searches for
similar past cases.
Decision is made based on the most similar case(s).
🔍 Key Components of a Case-Based Learning System
Component Description
Case Base A memory of past cases (instances, examples)
Similarity A method to compute how similar a new problem is to stored
Measure cases (e.g., Euclidean distance)
Retrieval
Algorithm to find the most relevant past cases
Mechanism
Adjust the solution from retrieved case(s) to fit the new
Adaptation
problem
Component Description
Update case base by adding new cases (and possibly removing
Learning
old ones)
🔍 Examples of Case-Based Learning Algorithms
1. K-Nearest Neighbors (KNN) – A classic case-based learner.
o Stores all training data.
o When a query comes, finds the k most similar cases and makes a
prediction (e.g., by majority vote for classification).
2. Locally Weighted Regression (LWR) – Predicts based on locally relevant
training data.
o Assigns weights to training instances based on proximity to the query.
3. Case-Based Reasoning (CBR) – Used in expert systems and AI, involving:
o Retrieve → Reuse → Revise → Retain cycle.
🔍 Case-Based Reasoning (CBR) Cycle
1. Retrieve most similar case(s)
2. Reuse the case to solve the problem
3. Revise the proposed solution if necessary
4. Retain the new solution as part of the case base
🔍 Advantages of Case-Based Learning
No need for an explicit training phase.
Naturally supports incremental learning.
Good for domains with episodic memory, like helpdesk systems, diagnosis,
etc.
Highly interpretable (reasoning is traceable through past cases).
🔍 Disadvantages
Slow during inference (especially for large datasets).
Memory-intensive, since it stores all (or most) of the training data
PAC learning model
Probably Approximately Correct (PAC) learning stands as a cornerstone theory,
offering insights into the fundamental question of how much data is needed for
learning algorithms to reliably generalize to unseen instances. PAC learning
provides a theoretical framework that underpins many machine learning
algorithms.
PAC Learning Theorem
The PAC learning theorem provides formal guarantees about the performance of
learning algorithms. It states that for a given accuracy (ε) and confidence (δ),
there exists a sample size (m) such that any learning algorithm that returns a
hypothesis consistent with the training samples will, with probability at least 1-δ,
have an error rate less than ε on unseen data.
Mathematically, the PAC learning theorem can be expressed as:
where:
M is the number of samples,
ϵ is the desired accuracy,
δ is the desired confidence level,
VC(H) is the Vapnik-Chervonenkis dimension of the hypothesis space HH.
The VC dimension is a measure of the capacity or complexity of the hypothesis
space. It quantifies the maximum number of points that can be shattered (i.e.,
correctly classified in all possible ways) by the hypotheses in the space. A higher
VC dimension indicates a more complex hypothesis space, which may require
more samples to ensure good generalization.
The PAC learning theorem provides a powerful tool for analyzing and designing
learning algorithms. It helps determine the sample size needed to achieve a
desired level of accuracy and confidence, guiding the development of efficient
and effective models.
Challenges of PAC Learning
Real-world Applicability
While PAC learning provides a solid theoretical foundation, applying it to real-
world problems can be challenging. The assumptions made in PAC learning, such
as the availability of a finite hypothesis space and the existence of a true
underlying function, may not always hold in practice.
In real-world scenarios, data distributions can be complex and unknown, and the
hypothesis space may be infinite or unbounded. These factors can complicate the
application of PAC learning, requiring additional techniques and considerations
to achieve practical results.
Computational Complexity
Finding the optimal hypothesis within the PAC framework can be
computationally expensive, especially for large and complex hypothesis spaces.
This can limit the practical use of PAC learning for certain applications,
particularly those involving high-dimensional data or complex models.
Efficient algorithms and optimization techniques are needed to make PAC
learning feasible for practical use. Researchers are continually developing new
methods to address the computational challenges of PAC learning and improve its
applicability to real-world problems.
Ensemble learning is a method where we use many small models instead of just
one. Each of these models may not be very strong on its own, but when we put
their results together, we get a better and more accurate answer. It's like asking a
group of people for advice instead of just one person—each one might be a little
wrong, but together, they usually give a better answer.
Types of Ensembles Learning in Machine Learning
There are three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating):
Models are trained independently on different random subsets of the training
data. Their results are then combined—usually by averaging (for regression) or
voting (for classification). This helps reduce variance and prevents overfitting.
2. Boosting:
Models are trained one after another. Each new model focuses on fixing the
errors made by the previous ones. The final prediction is a weighted
combination of all models, which helps reduce bias and improve accuracy.
3. Stacking (Stacked Generalization):
Multiple different models (often of different types) are trained, and their
predictions are used as inputs to a final model, called a meta-model. The meta-
model learns how to best combine the predictions of the base models, aiming
for better performance than any individual model.
1. Bagging Algorithm
Bagging classifier can be used for both regression and classification tasks. Here is
an overview of Bagging classifier algorithm:
Bootstrap Sampling: Divides the original training data into ‘N’ subsets and
randomly selects a subset with replacement in some rows from other subsets.
This step ensures that the base models are trained on diverse subsets of the data
and there is no class imbalance.
Base Model Training: For each bootstrapped sample we train a base model
independently on that subset of data. These weak models are trained in parallel
to increase computational efficiency and reduce time consumption. We can use
different base learners i.e. different ML models as base learners to bring
variety and robustness.
Prediction Aggregation: To make a prediction on testing data combine the
predictions of all base models. For classification tasks it can include majority
voting or weighted majority while for regression it involves averaging the
predictions.
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training
subset of particular base models during the bootstrapping method. These “out-
of-bag” samples can be used to estimate the model’s performance without the
need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models,
Bagging produces a final prediction for each instance.
Python pseudo code for Bagging Estimator implementing libraries:
1. Importing Libraries and Loading Data
BaggingClassifier: for creating an ensemble of classifiers trained on different
subsets of data.
DecisionTreeClassifier: the base classifier used in the bagging ensemble.
load_iris: to load the Iris dataset for classification.
train_test_split: to split the dataset into training and testing subsets.
accuracy_score: to evaluate the model’s prediction accuracy.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Iris Dataset
data = load_iris(): loads the Iris dataset, which includes features and target
labels.
X = data.data: extracts the feature matrix (input variables).
y = data.target: extracts the target vector (class labels).
train_test_split(...): splits the data into training (80%) and testing (20%) sets,
with random_state=42 to ensure reproducibility.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
3. Creating a Base Classifier
Decision tree is chosen as the base model. They are prone to overfitting when
trained on small datasets making them good candidates for bagging.
base_classifier = DecisionTreeClassifier(): initializes a Decision Tree
classifier, which will serve as the base estimator in the Bagging ensemble.
base_classifier = DecisionTreeClassifier()
4. Creating and Training the Bagging Classifier
A BaggingClassifier is created using the decision tree as the base classifier.
n_estimators = 10 specifies that 10 decision trees will be trained on different
bootstrapped subsets of the training data.
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10,
random_state=42)
bagging_classifier.fit(X_train, y_train)
5. Making Predictions and Evaluating Accuracy
The trained bagging model predicts labels for test data.
The accuracy of the predictions is calculated by comparing the predicted labels
(y_pred) to the actual labels (y_test).
y_pred = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
2. Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to create
a strong learner. Weak models are trained in series such that each next model
tries to correct errors of the previous model until the entire training dataset is
predicted correctly. One of the most well-known boosting algorithms is AdaBoost
(Adaptive Boosting). Here is an overview of Boosting algorithm:
Initialize Model Weights: Begin with a single weak learner and assign equal
weights to all training examples.
Train Weak Learner: Train weak learners on these dataset.
Sequential Learning: Boosting works by training models sequentially where
each model focuses on correcting the errors of its predecessor. Boosting
typically uses a single type of weak learner like decision trees.
Weight Adjustment: Boosting assigns weights to training datapoints.
Misclassified examples receive higher weights in the next iteration so that next
models pay more attention to them.
Python pseudo code for boosting Estimator implementing libraries:
1. Importing Libraries and Modules
AdaBoostClassifier from sklearn.ensemble: for building the AdaBoost
ensemble model.
DecisionTreeClassifier from sklearn.tree: as the base weak learner for
AdaBoost.
load_iris from sklearn.datasets: to load the Iris dataset.
train_test_split from sklearn.model_selection: to split the dataset into
training and testing sets.
accuracy_score from sklearn.metrics: to evaluate the model’s accuracy.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
2. Loading and Splitting the Dataset
data = load_iris(): loads the Iris dataset, which includes features and
target labels.
X = data.data: extracts the feature matrix (input variables).
y = data.target: extracts the target vector (class labels).
train_test_split(...): splits the data into training (80%) and testing (20%)
sets, with random_state=42 to ensure reproducibility.
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
3. Defining the Weak Learner
We are creating the base classifier as a decision tree with maximum depth 1 (a
decision stump). This simple tree will act as a weak learner for the AdaBoost
algorithm, which iteratively improves by combining many such weak learners.
base_classifier = DecisionTreeClassifier(max_depth=1)
4. Creating and Training the AdaBoost Classifier
base_classifier: The weak learner used in boosting.
n_estimators = 50: Number of weak learners to train sequentially.
learning_rate = 1.0: Controls the contribution of each weak learner to the
final model.
random_state = 42: Ensures reproducibility.
adaboost_classifier = AdaBoostClassifier(
base_classifier, n_estimators=50, learning_rate=1.0, random_state=42
)
adaboost_classifier.fit(X_train, y_train)
5. Making Predictions and Calculating Accuracy
We are calculating the accuracy of the model by comparing the true
labels y_test with the predicted labels y_pred. The accuracy_score function
returns the proportion of correctly predicted samples. Then, we print the accuracy
value.
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
Benefits of Ensemble Learning in Machine Learning
Ensemble learning is a versatile approach that can be applied to machine learning
model for: -
Reduction in Overfitting: By aggregating predictions of multiple model's
ensembles can reduce overfitting that individual complex models might
exhibit.
Improved Generalization: It generalizes better to unseen data by minimizing
variance and bias.
Increased Accuracy: Combining multiple models gives higher predictive
accuracy.
Robustness to Noise: It mitigates the effect of noisy or incorrect data points
by averaging out predictions from diverse models.
Flexibility: It can work with diverse models including decision trees, neural
networks and support vector machines making them highly adaptable.
Bias-Variance Tradeoff: Techniques like bagging reduce variance, while
boosting reduces bias leading to better overall performance.
There are various ensemble learning techniques we can use as each one of them
has their own pros and cons.
Ensemble Learning Techniques
Technique Category Description
Random forest constructs multiple decision trees
on bootstrapped subsets of the data and aggregates
Bagging
their predictions for final output, reducing
Random Forest overfitting and variance.
Random Trains models on random subsets of input features
Subspace Bagging to enhance diversity and improve generalization
Method while reducing overfitting.
Gradient Gradient Boosting Machines sequentially builds
Boosting decision trees, with each tree correcting errors of
Boosting
Machines the previous ones, enhancing predictive accuracy
(GBM) iteratively.
Extreme
XGBoost do optimizations like tree pruning,
Gradient
Boosting regularization, and parallel processing for robust
Boosting
and efficient predictive models.
(XGBoost)
AdaBoost Boosting AdaBoost focuses on challenging examples by
Technique Category Description
(Adaptive assigning weights to data points. Combines weak
Boosting) classifiers with weighted voting for final
predictions.
CatBoost specialize in handling categorical
features natively without extensive preprocessing
Boosting
with high predictive accuracy and automatic
CatBoost overfitting handling.