Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2 views21 pages

Unit 3

The document provides an overview of Decision Tree algorithms, explaining their structure, key concepts, and the process of attribute selection using Information Gain and Gini Index. It also discusses Random Forest as an ensemble method that builds multiple decision trees for improved accuracy and robustness, along with the concepts of bagging and boosting as techniques to enhance model performance. Advantages and disadvantages of these methods, along with their applications in various fields, are also highlighted.

Uploaded by

gtejashwini55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

Unit 3

The document provides an overview of Decision Tree algorithms, explaining their structure, key concepts, and the process of attribute selection using Information Gain and Gini Index. It also discusses Random Forest as an ensemble method that builds multiple decision trees for improved accuracy and robustness, along with the concepts of bagging and boosting as techniques to enhance model performance. Advantages and disadvantages of these methods, along with their applications in various fields, are also highlighted.

Uploaded by

gtejashwini55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Untitled

Sunday, August 11, 2024 2:58 PM

Unit-3
Decision Tree algorithms:
Decision tree algorithms are a type of supervised machine learning method used for classification
and regression tasks. They work by recursively splitting a dataset into subsets based on the value of
input features, ultimately leading to a tree-like structure where each internal node represents a
"test" or decision on an attribute, each branch represents the outcome of that test, and each leaf
node represents a class label (in classification) or a continuous value (in regression).
Key Concepts of Decision Trees:
1. Root Node: The topmost node in a decision tree. It represents the entire dataset and is split into two
or more homogeneous sets.
2. Splitting: The process of dividing a node into two or more sub-nodes based on a feature. The choice
of feature and the specific value for the split are determined by a measure of "impurity" or "error".
3. Decision Node: A node that has two or more child nodes. It's where the data is further split based on
a feature.
4. Leaf Node (Terminal Node): A node that does not split further. It represents the final output or
prediction (class label or regression value).
5. Pruning: The process of removing nodes to prevent the model from overfitting. It simplifies the tree
by trimming branches that have little importance.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
1. Information Gain:
• Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree.
• A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.

AML Notes UNIT-3 Page 1


Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
• S= Total number of samples
• P(yes)= probability of yes
• P(no)= probability of no
2. Gini Index:
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
• Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
• Cost Complexity Pruning
• Reduced Error Pruning.

6. Impurity Measures:
o Gini Impurity: Used in the Gini Index, it measures how often a randomly chosen element
would be incorrectly classified.
o Entropy: Used in Information Gain, it measures the randomness in the information being
processed.
o Variance Reduction: Used in regression trees, it measures how much the variance is reduced
in the target variable after the split.
Popular Decision Tree Algorithms:
ID3 (Iterative Dichotomiser 3): Uses entropy and information gain to build a tree. It selects the
attribute that maximizes information gain for splitting.
C4.5: An extension of ID3 that handles both categorical and continuous data. It uses Gain Ratio as
the splitting criterion and includes mechanisms for pruning.
CART (Classification and Regression Trees): Can be used for both classification and regression. For
classification, it uses the Gini impurity as the splitting criterion. For regression, it uses variance
reduction. CART produces binary trees (every node has two children).
CHAID (Chi-square Automatic Interaction Detector): A more statistical approach that uses chi-
square tests to determine the best splits. It's used mainly for categorical variables and can create
non-binary trees.
Random Forest: An ensemble method that builds multiple decision trees and merges them to get a
more accurate and stable prediction. It reduces the variance of predictions, making the model less
likely to overfit.
Gradient Boosted Trees (GBTs): Another ensemble technique that builds trees sequentially, where
each new tree corrects the errors made by the previous ones. It's highly effective for both regression
and classification tasks.
Advantages of Decision Trees:
• Simple to Understand and Interpret: Trees can be visualized, and their logic is easy to follow.
• No Need for Feature Scaling: Unlike other algorithms, decision trees don't require
normalization or scaling of data.
• Handles Both Categorical and Continuous Data: Decision trees can work with different types
of data.
• Non-parametric: No assumptions about the distribution of data.
Disadvantages of Decision Trees:
• Overfitting: Trees can become overly complex, capturing noise in the data.
• Unstable: Small variations in the data can lead to a completely different tree structure.
• Bias towards features with more levels: Features with more possible split points can
dominate the decision tree.
Applications:
• Medical Diagnosis: Classifying diseases based on symptoms.

AML Notes UNIT-3 Page 2


• Medical Diagnosis: Classifying diseases based on symptoms.
• Credit Scoring: Assessing the risk of lending to individuals.
• Customer Segmentation: Grouping customers based on behavior for targeted marketing.
Decision tree algorithms provide a solid foundation for more advanced ensemble methods and are a
fundamental tool in machine learning.
Random Forest algorithm:
The Random Forest algorithm is a popular ensemble learning method used for both classification
and regression tasks. It operates by constructing multiple decision trees during training and
outputting the class that is the mode of the classes (for classification) or the mean prediction (for
regression) of the individual trees. Random Forest is particularly well-known for its accuracy,
robustness, and ability to handle large datasets with higher dimensionality.
Key Concepts of Random Forest:
Ensemble Learning: Random Forest is an ensemble method, meaning it builds a "forest" of decision
trees, each trained on a different subset of data and features. The final prediction is based on the
aggregated results of these trees, making the model more robust than a single decision tree.
Bootstrap Aggregating (Bagging):
o Bootstrap Sampling: Each tree in the forest is trained on a different sample of data. These
samples are drawn with replacement from the original dataset, meaning some data points
may appear multiple times in a sample, while others may not appear at all.
o Aggregation: For classification tasks, Random Forest takes the majority vote of all the trees in
the forest to make a final prediction. For regression tasks, it averages the outputs of the trees.
Random Subset of Features: During the training of each tree, Random Forest randomly selects a
subset of features to consider for splitting at each node. This randomization helps reduce the
correlation between trees, improving the overall model's generalization ability.
Out-of-Bag (OOB) Error: Since each tree is trained on a different bootstrap sample, some data points
are left out of that sample (out-of-bag). These out-of-bag data points can be used to estimate the
error of the model without needing a separate validation set, providing a built-in method for model
evaluation.
Advantages of Random Forest:
• High Accuracy: Random Forest often provides better accuracy than a single decision tree due
to the reduction of overfitting and variance.
• Robustness to Noise: By averaging multiple trees, Random Forest reduces the impact of noise
and outliers.
• Handles High Dimensionality: The algorithm performs well even with large numbers of
features.
• Feature Importance: Random Forest can provide estimates of the importance of each feature
in the prediction process, which can be useful for feature selection.
• Scalability: It can be parallelized easily, making it efficient for large datasets.
Disadvantages of Random Forest:
• Complexity: The model is less interpretable compared to a single decision tree, as it consists of
multiple trees.
• Slower Predictions: Due to the large number of trees, making predictions can be slower
compared to simpler models.
• Memory Usage: Random Forest can be memory-intensive, especially with large datasets and a
large number of trees.11101
Hyperparameters in Random Forest:
Number of Trees (n_estimators): The number of trees in the forest. More trees generally improve
performance but increase computati 1h1?uon time.
Maximum Depth (max_depth): Limits the depth of each tree. Shallower trees reduce the risk of
overfitting but may underfit.
Minimum Samples Split (min_samples_split): The minimum number of samples required to split an
internal node. Higher values prevent the model from learning overly specific patterns (overfitting).
Minimum Samples Leaf (min_samples_leaf): The minimum number of samples required to be at a
leaf node. Larger values lead to smoother decision boundaries.
Maximum Features (max_features): The number of features to consider when looking for the best
split. A lower value reduces overfitting but may lead to underfitting.
Bootstrap: A boolean parameter that determines whether bootstrap samples are used when
building trees. If False, the whole dataset is used to build each tree.
Applications of Random Forest:
• Classification: Spam detection, fraud detection, image classification.
• Regression: Predicting house prices, stock market predictions.

AML Notes UNIT-3 Page 3


• Regression: Predicting house prices, stock market predictions.
• Feature Selection: Identifying the most important features in datasets.
• Anomaly Detection: Detecting rare events in large datasets.
Example Workflow with Random Forest:
Data Preparation: Clean and preprocess the dataset.
Model Training: Set the Random Forest hyperparameters and train the model using the training
dataset.
Model Evaluation: Use out-of-bag error or a separate validation set to evaluate the model's
performance.
Hyperparameter Tuning: Adjust parameters like the number of trees, maximum depth, and
minimum samples per leaf to optimize performance.
Feature Importance: Analyze the importance of features to interpret the model and possibly reduce
the dimensionality of the data.
Random Forest is widely used due to its versatility, reliability, and the fact th t it often outperforms
simpler models, especially when dealing with complex datasets.
Example:
Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During
the training phase, each decision tree produces a prediction result, and when a new data point
occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:

From <https://www.javatpoint.com/machine-learning-random-forest-algorithm>

Bagging:
Bagging, short for Bootstrap Aggregating, is an ensemble learning technique designed to improve
the performance and robustness of machine learning models. It works by combining the predictions
of multiple models to make a final prediction, with the primary aim of reducing variance and
preventing overfitting.
Key Concepts:
Bootstrap Sampling:
o Bagging involves creating multiple subsets of the training data through a process called
bootstrap sampling. In this process, each subset is generated by randomly sampling the
original dataset with replacement.
o Each of these subsets (bootstrap samples) has the same size as the original dataset but may
contain duplicate instances.
Training Multiple Models:
o A separate model is trained on each bootstrap sample. Since the data samples are different for
each model, each model will learn slightly different patterns.
Aggregating Predictions:
o For regression tasks, the final prediction is typically the average of the predictions from all
individual models.
For classification tasks, the final prediction is usually determined by majority voting (i.e., the
class that receives the most votes from the individual models is chosen).

AML Notes UNIT-3 Page 4


class that receives the most votes from the individual models is chosen).
Reducing Variance:
o Bagging primarily aims to reduce the variance of the model. By combining predictions from
multiple models trained on different subsets of the data, bagging reduces the risk of
overfitting and improves the model's generalization ability.
o It does not significantly affect the bias of the model but can make the model more robust to
fluctuations in the training data.
Steps in Bagging:
1. Generate Bootstrap Samples:
o Create B bootstrap samples from the original training dataset. Each sample is created by
randomly selecting instances with replacement.
• Train Models:
o Train a base model (e.g., decision tree, linear model) on each of the BBB bootstrap samples.
The base models are typically the same type but trained independently on different data
subsets.
Aggregate Predictions:
o For regression: Compute the average prediction across all base models.
o For classification: Use majority voting to determine the final class label.
Example:
Suppose you want to apply bagging to improve the performance of a decision tree classifier:
• Generate Bootstrap Samples:
o Create several bootstrap samples from your training data. Each sample is a random subset of
the original data, with some instances possibly repeated.
• Train Decision Trees:
o Train a separate decision tree on each bootstrap sample. Each decision tree will learn different
patterns due to the variations in the training data.
7. Aggregate Predictions:
o For a new data point, get predictions from each decision tree.
o For classification, use majority voting: the class that the majority of trees predict is chosen as
the final class.
o For regression, average the predictions from all trees to obtain the final prediction.
Benefits:
• Improved Accuracy: By combining multiple models, bagging often results in better predictive
accuracy compared to a single model.
• Reduced Overfitting: Helps in reducing overfitting by averaging out the errors of individual
models.
• Robustness: More robust to noisy data and fluctuations in the training set.
Popular Algorithms Using Bagging:
• Random Forests: An extension of bagging where the base models are decision trees, and at
each split in the decision tree, only a random subset of features is considered. This adds an
additional layer of randomness and typically improves performance further.
• Bagged Decision Trees: A straightforward application of bagging where each base model is a
decision tree.
Limitations:
• Increased Complexity: Bagging increases the computational cost because multiple models
need to be trained and aggregated.
• Does Not Reduce Bias: While it reduces variance, it does not address the bias of the individual
models. If the base model has high bias, bagging will not significantly improve its performance.
Bagging is a powerful technique for improving the accuracy and stability of machine learning models,
particularly useful when the base models are prone to high variance, such as decision trees.

AML Notes UNIT-3 Page 5


Boosting:
Boosting is an ensemble learning technique designed to improve the performance of a machine
learning model by combining multiple weak learners to create a strong learner. The main idea
behind boosting is to build models sequentially, with each new model focusing on the errors made
by the previous ones. Here's a closer look at boosting:
Key Concepts in Boosting
8. Weak Learner:
o A weak learner is a model that performs slightly better than random guessing. In practice,
simple models like decision stumps (one-level decision trees) are often used as weak learners
in boosting.
9. Sequential Learning:
o Models are built sequentially. Each new model is trained to correct the mistakes of the
previous models. This iterative process helps improve the overall performance of the
ensemble.
10. Weight Adjustment:
o During each iteration, the algorithm adjusts the weights of incorrectly classified instances so
that subsequent models focus more on the hard-to-classify examples.
11. Combining Models:
o The final model is a weighted sum of all the weak learners. The idea is that combining many
weak models will yield a powerful and accurate model.
Popular Boosting Algorithms
12. AdaBoost (Adaptive Boosting):
o Description: One of the earliest and most well-known boosting algorithms. AdaBoost adjusts
the weights of misclassified samples and combines multiple weak classifiers to form a strong
classifier.
o Use Cases: Binary and multiclass classification problems.
o Strengths: Simple to implement, effective with many types of weak learners.
o Limitations: Sensitive to noisy data and outliers.
13. Gradient Boosting Machines (GBM):
o Description: Builds models sequentially where each new model attempts to correct the
residual errors from the previous model. It uses gradient descent to minimize a loss function.
o Use Cases: Versatile and used in many applications, from finance to healthcare.
o Strengths: High accuracy, can handle different types of data.
o Limitations: Computationally expensive, sensitive to hyperparameter tuning.
14. XGBoost (Extreme Gradient Boosting):
o Description: An optimized version of gradient boosting that includes additional features like
regularization to prevent overfitting, and efficient handling of missing values.
o Use Cases: Kaggle competitions, real-world problems with large datasets.
o Strengths: High performance, scalability, robust to overfitting.
o Limitations: Requires careful tuning, may be complex for beginners.
Advantages of Boosting
• Improved Accuracy: Boosting generally improves model accuracy by focusing on difficult-to-
classify examples.
• Versatility: Can be applied to various types of data and problems.

AML Notes UNIT-3 Page 6


• Versatility: Can be applied to various types of data and problems.
• Robustness: Helps to reduce overfitting through techniques like regularization (in advanced
algorithms like XGBoost).
Disadvantages of Boosting
• Computational Complexity: Boosting can be computationally expensive and time-consuming,
especially with large datasets.
• Sensitivity to Noise: Boosting algorithms can be sensitive to noisy data and outliers.
• Parameter Tuning: Requires careful tuning of hyperparameters to achieve optimal
performance.
Boosting is a powerful technique that often yields state-of-the-art performance in machine learning
tasks, especially when dealing with complex datasets and challenging problems.
Gradient Boosting:
Gradient Boosting is an ensemble learning technique used to build predictive models by combining
multiple weak learners, typically decision trees, to create a strong learner. It builds models
sequentially, with each new model aiming to correct the errors made by the previous ones. Here’s a
detailed overview of Gradient Boosting:
How Gradient Boosting Works
Initialization:
o Start with an initial model, usually a simple model that predicts the mean or median of the
target variable (for regression) or the majority class (for classification).
Compute Residuals:
o Calculate the residuals, which are the differences between the actual target values and the
predictions made by the current model. These residuals represent the errors that need to be
corrected.
Fit a New Model:
o Train a new model (a weak learner) to predict these residuals. The new model aims to correct
the errors of the previous model.
• Update the Model:
o Add the new model to the existing ensemble, usually with a learning rate to control the
contribution of each new model. The learning rate is a hyperparameter that scales the new
model’s predictions before adding them to the ensemble.
• Iterate:
o Repeat the process of computing residuals, fitting new models, and updating the ensemble
until a specified number of models are added or until performance improvement stops.
15. Combine Models:
o The final prediction is obtained by aggregating the predictions of all models in the ensemble,
typically by summing their predictions.
Gradient Boosting in ML

Gradient Boosting is a popular boosting algorithm in machine learning used for classification and
regression tasks. Boosting is one kind of ensemble Learning method which trains the model
sequentially and each new model tries to correct the previous model. It combines several weak
learners into strong learners. There is two most popular boosting algorithm i.e
16. AdaBoost
17. Gradient Boosting
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines several weak learners into strong
learners, in which each new model is trained to minimize the loss function such as mean squared
error or cross-entropy of the previous model using gradient descent. In each iteration, the algorithm
computes the gradient of the loss function with respect to the predictions of the current ensemble
and then trains a new weak model to minimize this gradient. The predictions of the new model are
then added to the ensemble, and the process is repeated until a stopping criterion is met.
In contrast to AdaBoost, the weights of the training instances are not tweaked, instead, each
predictor is trained using the residual errors of the predecessor as labels. There is a technique called
the Gradient Boosted Trees whose base learner is CART (Classification and Regression Trees). The
below diagram explains how gradient-boosted trees are trained for regression problems.

AML Notes UNIT-3 Page 7


Gradient Boosted Trees for Regression
The ensemble consists of M trees. Tree1 is trained using the feature matrix X and the labels y. The
predictions labeled y1(hat) are used to determine the training set residual errors r1. Tree2 is then
trained using the feature matrix X and the residual errors r1 of Tree1 as labels. The predicted
results r1(hat) are then used to determine the residual r2. The process is repeated until all the M
trees forming the ensemble are trained. There is an important parameter used in this technique
known as Shrinkage. Shrinkage refers to the fact that the prediction of each tree in the ensemble is
shrunk after it is multiplied by the learning rate (eta) which ranges between 0 to 1. There is a trade-
off between eta and the number of estimators, decreasing learning rate needs to be compensated
with increasing estimators in order to reach certain model performance. Since all trees are trained
now, predictions can be made. Each tree predicts a label and the final prediction is given by the
formula,
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)
In gradient boosting decision trees, we combine many weak learners to come up with one strong
learner.
The weak learners here are the individual decision trees. All the trees are connected in series and
each tree tries to minimize the error of the previous tree. Due to this sequential connection,
boosting algorithms are usually slow to learn (controllable by the developer using the learning rate
parameter), but also highly accurate.
In statistical learning, models that learn slowly perform better. The weak learners are fit in such a
way that each new learner fits into the residuals of the previous step so as the model improves. The
final model adds up the result of each step and thus a stronger learner is eventually achieved.

A loss function is used to detect the residuals.


For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log
loss) can be used for classification tasks. It is worth noting that existing trees in the model do not
change when a new tree is added.
The added decision tree fits the residuals from the current model.

From <https://www.machinelearningplus.com/machine-learning/gradient-boosting/>

Difference between Adaboost and Gradient Boosting

The difference between AdaBoost and gradient boosting are as follows:

AdaBoost Gradient Boosting


During each iteration in AdaBoost, the weights of Gradient Boosting updates the weights by
incorrectly classified samples are increased, so that computing the negative gradient of the loss
the next weak learner focuses more on these function with respect to the predicted
samples. output.
AdaBoost uses simple decision trees with one split Gradient Boosting can use a wide range of
known as the decision stumps of weak learners. base learners, such as decision trees, and
linear models.
AdaBoost is more susceptible to noise and outliers Gradient Boosting is generally more robust,
in the data, as it assigns high weights to misclassified as it updates the weights based on the
samples gradients, which are less sensitive to outliers.

AML Notes UNIT-3 Page 8


samples gradients, which are less sensitive to outliers.

Key Concepts in Gradient Boosting


18. Residuals:
o The difference between the actual target values and the predictions made by the current
ensemble model. Residuals indicate the areas where the model is still making errors.
19. Learning Rate:
o A hyperparameter that controls how much each new model contributes to the ensemble.
Smaller learning rates make the training process more robust but require more iterations.
20. Weak Learner:
o Typically a small decision tree, also known as a decision stump. The weak learner is trained to
predict residuals rather than the target variable directly.
21. Loss Function:
o The function used to measure the performance of the model and guide the optimization
process. Common loss functions include mean squared error for regression and log loss for
classification.
Advantages of Gradient Boosting
• High Accuracy: Gradient Boosting often provides superior predictive performance and can
handle complex datasets.
• Flexibility: Can be used with various loss functions and weak learners, making it versatile for
different types of problems.
• Feature Importance: Can provide insights into feature importance, which is useful for
understanding the model.
Disadvantages of Gradient Boosting
• Computationally Intensive: Training can be time-consuming, especially for large datasets and
complex models.
• Sensitivity to Hyperparameters: Requires careful tuning of hyperparameters like learning rate,
number of trees, and tree depth.
• Overfitting: Can overfit the training data if not properly regularized or if the model is too
complex.
Popular Gradient Boosting Implementations
22. XGBoost (Extreme Gradient Boosting):

o Description: An optimized and efficient implementation of gradient boosting with


regularization to prevent overfitting and support for parallel processing.
o Use Cases: Competitions, large datasets, diverse applications.
o Strengths: High performance, scalable, robust to overfitting.
23. LightGBM (Light Gradient Boosting Machine):
o Description: A gradient boosting framework that uses histogram-based methods for faster
training and lower memory usage.
o Use Cases: Large-scale datasets, high-dimensional data.
o Strengths: Fast, efficient, handles large datasets effectively.
24. CatBoost (Categorical Boosting):
o Description: A gradient boosting library that is particularly good at handling categorical
features directly.
o Use Cases: Problems with many categorical features, such as in finance and retail.
o Strengths: Efficient handling of categorical variables, robust performance.
Applications
Gradient Boosting is widely used in various domains, including:
• Finance: Credit scoring, fraud detection.
• Healthcare: Disease prediction, patient risk assessment.
• Marketing: Customer segmentation, churn prediction.
• Competition: Kaggle competitions and other data science challenges.
Gradient Boosting is a powerful technique that, when tuned properly, can achieve state-of-the-art
results in many machine learning tasks.

AML Notes UNIT-3 Page 9


 Concept: Gradient Boosting builds models sequentially, where each new model tries to correct
the errors of the combined ensemble of previous models by minimizing a loss function.
 Algorithm:
25. Initialize with a base model (often a simple model like the mean of the target variable).
26. Compute the residuals (errors) of the current model.
27. Fit a new model to these residuals.
28. Update the ensemble by adding the new model, usually scaled by a learning rate.
29. Repeat until a stopping criterion is met.
 Strengths: Highly flexible and can be tuned to various types of loss functions.
 Weaknesses: Can be slow to train and sensitive to hyperparameters like learning rate.
Ada Boosting:
AdaBoost (Adaptive Boosting) is a popular ensemble learning technique designed to improve the
performance of weak classifiers by combining them into a strong classifier. It was introduced by Yoav
Freund and Robert Schapire in 1995. AdaBoost is particularly known for its simplicity and
effectiveness.
Adaboost uses decision stumps as weak learners.
Decision stumps are nothing but decision trees with only one single split. It also attached weights to
observations, adding more weight to ‘difficult-to-classify’ observations and less weight to those that
are easy to classify.
The aim is to put stress on the difficult to classify instances for every new weak learner. So, for the
next subsequent model, the misclassified observations will receive more weight, as a result, in the
new dataset these observations are sampled more number of times according to their new weights,
giving the model a chance to learn more of such records and classify them correctly.

As a result, misclassifying the ‘difficult-to-classify’ would be discouraged. Gradient boosting


algorithm is slightly different from Adaboost. How? Gradient boosting simply tries to explain
(predict) the error left over by the previous model.
And since the loss function optimization is done using gradient descent, and hence the name
gradient boosting. Further, gradient boosting uses short, less-complex decision trees instead of
decision stumps.

How AdaBoost Works


30. Initialization:
o Start with a dataset where each instance has an equal weight. The initial model is usually a
weak learner, such as a decision stump (a one-level decision tree).
31. Iterative Training:
o Step 1: Train the weak learner on the weighted dataset. Initially, all samples have equal
weights.
o Step 2: Calculate the errors of the weak learner. The error is the weighted sum of the
misclassified samples.
o Step 3: Compute the weight of the weak learner based on its error. Weak learners that
perform well are assigned higher weights, while those that perform poorly are assigned lower
weights.
o Step 4: Update the weights of the samples. Increase the weights of misclassified samples so
that the next weak learner focuses more on the difficult cases.
o Step 5: Normalize the weights so that they sum up to 1.
32. Combine Weak Learners:
o After a specified number of iterations, combine all the weak learners into a single strong
classifier. Each weak learner's prediction is weighted by its importance (computed in Step 3).
33. Final Prediction:
o The final prediction is made by aggregating the predictions of all weak learners, typically using
a weighted vote.
Key Concepts in AdaBoost
34. Weak Learner:
o A classifier that performs slightly better than random guessing. In practice, decision stumps
are commonly used as weak learners in AdaBoost.
35. Sample Weights:

AML Notes UNIT-3 Page 10


o Weights assigned to each sample in the training dataset. These weights are adjusted in each
iteration to focus more on the misclassified samples.
36. Classifier Weights:
o Weights assigned to each weak learner based on its performance. A classifier with lower error
has a higher weight in the final model.
37. Error Calculation:
o The error of a weak learner is the weighted sum of the misclassified instances. The formula for
error is Error=∑i=1nwi⋅misclassifiedi∑i=1nwii, where wi is the weight of the i-th instance,
and misclassifiedii is 1 if the instance is misclassified.
Advantages of AdaBoost
• Improved Accuracy: AdaBoost often results in higher accuracy by focusing on hard-to-classify
instances.
• Simplicity: It’s straightforward to implement and can work with a variety of weak learners.
• Flexibility: Can be used with different types of classifiers and loss functions.
Disadvantages of AdaBoost
• Sensitivity to Noisy Data: AdaBoost can be sensitive to noisy data and outliers since it focuses
more on the misclassified samples.
• Computational Complexity: Training can be computationally expensive, especially with a large
number of weak learners.
Applications
AdaBoost is used in various fields and applications, including:
• Image Classification: Face detection, object recognition.
• Text Classification: Spam detection, sentiment analysis.
• Medical Diagnosis: Disease classification, risk prediction.
Key Points
• Boosting Process: AdaBoost builds models sequentially and adjusts sample weights to improve
performance on misclassified instances.
• Combining Models: The final model is a weighted combination of all the weak learners, with
weights based on their performance.
AdaBoost remains a widely used technique due to its effectiveness and ease of use. It’s especially
useful in scenarios where the data is noisy or when a simple model does not provide satisfactory
performance.
 Algorithm:
38. Initialize weights for all instances in the training set.
39. Train a weak learner (e.g., decision tree with a single split) and calculate its error.
40. Update the weights of misclassified instances to increase their importance.
41. Train the next weak learner on the updated weights.
42. Combine the predictions of all weak learners, weighted by their performance.
 Strengths: Simple to implement, effective for a wide range of problems.
 Weaknesses: Sensitive to noisy data and outliers.
XG Boosting:
XGBoost (Extreme Gradient Boosting) is a highly efficient and scalable implementation of gradient
boosting designed for large-scale machine learning tasks. Developed by Tianqi Chen, XGBoost has
become one of the most popular and successful machine learning algorithms, known for its high
performance in various predictive modeling competitions.
XGBoost is an implementation of Gradient Boosted decision trees. XGBoost models majorly
dominate in many Kaggle Competitions.
In this algorithm, decision trees are created in sequential form. Weights play an important role in
XGBoost. Weights are assigned to all the independent variables which are then fed into the decision
tree which predicts results. The weight of variables predicted wrong by the tree is increased and
these variables are then fed to the second decision tree. These individual classifiers/predictors then
ensemble to give a strong and more precise model. It can work on regression, classification, ranking,
and user-defined prediction problems.

Key Features of XGBoost


43. Gradient Boosting Framework:

AML Notes UNIT-3 Page 11


o XGBoost builds on the gradient boosting framework, which sequentially improves model
performance by combining weak learners (typically decision trees) and correcting errors from
previous models.
44. Regularization:
o XGBoost includes regularization terms (L1 and L2) to control model complexity and prevent
overfitting. This feature is a significant improvement over traditional gradient boosting
methods.
45. Handling Missing Values:
o XGBoost can handle missing values natively. It learns how to handle them during training,
which simplifies preprocessing.
46. Parallel and Distributed Computing:
o XGBoost supports parallel processing, which speeds up training by utilizing multiple CPU cores.
It also supports distributed computing, making it scalable to large datasets.
47. Tree Pruning:
o XGBoost uses a more efficient tree pruning method known as "max depth" pruning, which
helps improve performance and reduce computation time.
48. Built-in Cross-Validation:
o It provides a built-in cross-validation feature that helps in evaluating the model’s performance
during training.
49. Flexible Loss Functions:
o XGBoost supports a variety of loss functions for different tasks, such as regression,
classification, and ranking.
How XGBoost Works
50. Initialization:
o Start with an initial prediction, which is typically a simple model such as the mean for
regression or a base probability for classification.
51. Compute Residuals:
o Calculate the residuals (errors) of the current model. These residuals represent the difference
between the actual target values and the predictions made by the model.
52. Fit New Models:
o Train a new weak learner (usually a decision tree) to predict these residuals. The new model
aims to correct the errors of the previous models.
53. Update Predictions:
o Update the model’s predictions by adding the new model’s predictions, weighted by a learning
rate. The learning rate controls how much each new model contributes to the overall
ensemble.
54. Iterate:
o Repeat the process of computing residuals, fitting new models, and updating predictions for a
specified number of iterations or until performance improves.
55. Combine Models:
o Combine the predictions of all weak learners to make the final prediction. Each model’s
contribution is weighted based on its performance.
Key Hyperparameters in XGBoost
56. Learning Rate (eta):
o Controls the step size at each iteration while moving toward the optimal solution. A smaller
learning rate requires more boosting rounds but can lead to better performance.
57. Number of Trees (n_estimators):
o The number of boosting rounds or trees in the ensemble.
58. Maximum Depth (max_depth):
o The maximum depth of each decision tree. Controlling this parameter helps in reducing
overfitting.
59. Min Child Weight (min_child_weight):
o Minimum sum of instance weight (hessian) needed in a child node. It helps in controlling
overfitting.
60. Subsample (subsample):
o The fraction of samples used for fitting each individual tree. It helps in preventing overfitting.
61. Column Sample by Tree/Level (colsample_bytree, colsample_bylevel):

AML Notes UNIT-3 Page 12


61. Column Sample by Tree/Level (colsample_bytree, colsample_bylevel):
o The fraction of features to be randomly sampled for each tree or level.
62. Regularization Parameters (alpha, lambda):
o alpha (L1 regularization) and lambda (L2 regularization) help in controlling model complexity
and preventing overfitting.
Advantages of XGBoost
• Performance: XGBoost often provides state-of-the-art performance in various machine
learning tasks, including competitions.
• Scalability: Efficiently handles large datasets and supports parallel and distributed computing.
• Flexibility: Supports multiple loss functions and evaluation criteria.
• Regularization: Includes built-in regularization to avoid overfitting.
Disadvantages of XGBoost
• Complexity: Requires careful tuning of hyperparameters, which can be challenging for
beginners.
• Computational Resources: Can be computationally intensive, especially with large datasets
and many boosting rounds.
Applications
XGBoost is widely used in various domains:
• Competitions: Frequently used in Kaggle and other data science competitions.
• Finance: Credit scoring, fraud detection.
• Healthcare: Disease prediction, risk assessment.
• Marketing: Customer segmentation, churn prediction.
XGBoost's efficiency, flexibility, and high performance have made it a go-to choice for many machine
learning practitioners and data scientists.
 Concept: An optimized implementation of Gradient Boosting that incorporates regularization (to
prevent overfitting) and parallel processing (to speed up computation).
 Algorithm:
63. Start with a base model.
64. For each iteration, fit a new model to the residuals with additional regularization terms.
65. Use gradient descent to minimize a loss function.
66. Incorporate a regularization term to control model complexity.
67. Combine models to form the final prediction.
 Strengths: High performance, handles missing values, and robust against overfitting.
 Weaknesses: Requires careful tuning of hyperparameters.
KNN Algorithm:
The k-Nearest Neighbors (k-NN) algorithm is a simple and intuitive supervised learning method used
for both classification and regression tasks. It operates based on the principle that similar data
points tend to be close to each other in the feature space.

How k-NN Works


68. Choose the Number of Neighbors (k):
o Select the number of nearest neighbors (k) to consider when making predictions. This is a
hyperparameter that determines how many of the closest data points will influence the
classification or regression outcome.
69. Distance Metric:
o Calculate the distance between the query point (the data point you want to classify or predict)
and all other points in the training dataset. Common distance metrics include:
▪ Euclidean Distance: d(x,y)=∑i=1n(xi−yi)2d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
d(x,y)=∑i=1n(xi−yi)2
▪ Manhattan Distance: d(x,y)=∑i=1n∣ xi−yi∣ d(x, y) = \sum_{i=1}^{n} |x_i -
y_i|d(x,y)=∑i=1n∣ xi−yi∣
▪ Minkowski Distance: Generalization of Euclidean and Manhattan distances, with a
parameter ppp.
70. Find Nearest Neighbors:
o Identify the k nearest data points to the query point based on the chosen distance metric.

AML Notes UNIT-3 Page 13


71. Make a Prediction:
o Classification: Assign the class label that is most common among the k nearest neighbors. This
is often done using a majority vote.
o Regression: Compute the average (or weighted average) of the target values of the k nearest
neighbors to make a prediction.
Key Considerations
72. Choosing k:
o A small value of k (e.g., 1) can make the model sensitive to noise and outliers, leading to
overfitting. A larger value of k smooths the decision boundary and reduces the impact of noise
but can lead to underfitting.
o A common practice is to use cross-validation to determine the optimal value of k.
73. Distance Metric:
o The choice of distance metric can affect the performance of the k-NN algorithm. Euclidean
distance is commonly used, but other metrics like Manhattan distance might be more suitable
for certain types of data.
74. Feature Scaling:
o k-NN is sensitive to the scale of the features because distance calculations are affected by the
magnitude of the features. Therefore, it's important to standardize or normalize features
before applying k-NN.
75. Computational Complexity:
o k-NN can be computationally expensive, especially for large datasets, as it requires calculating
distances between the query point and all points in the training set. Using data structures like
KD-trees or ball trees can help improve efficiency.
76. Handling Large Datasets:
o For very large datasets, approximate nearest neighbor algorithms (like Locality Sensitive
Hashing or Approximate Nearest Neighbors) may be used to speed up the search process.
Advantages of k-NN
• Simplicity: Easy to understand and implement.
• No Training Phase: Unlike other algorithms, k-NN does not require a separate training phase;
the training data is used directly for making predictions.
• Adaptability: Can be used for both classification and regression tasks.
Disadvantages of k-NN
• Computationally Intensive: Requires calculating distances to all training samples, which can
be slow for large datasets.
• Sensitive to Noise: Outliers and noisy data points can negatively impact performance.
• Curse of Dimensionality: Performance can degrade with high-dimensional data due to the
increased sparsity of the feature space.
Applications
k-NN is used in various domains:
• Image Recognition: Identifying objects or patterns in images.
• Recommender Systems: Suggesting products or content based on user preferences.
• Medical Diagnosis: Classifying diseases based on patient data.
• Anomaly Detection: Identifying unusual patterns in data.
In summary, k-NN is a versatile and straightforward algorithm that can perform well with
appropriately scaled features and a well-chosen value of k. It is best suited for smaller to medium-
sized datasets where computational efficiency is not a critical concern.
Market Basket Analysis:
Market Basket Analysis (MBA) is a technique used to understand the purchase behavior of
customers by analyzing the patterns of items frequently bought together. This analysis is often used
in retail to identify product associations, which can help in inventory management, promotional
strategies, and cross-selling opportunities. The most common algorithm used for Market Basket
Analysis is the Apriori Algorithm, but there are other algorithms such as FP-Growth and Eclat as well.
Key Concepts:
77. Itemsets: Combinations of items that appear together in transactions.
78. Support: The proportion of transactions that contain a particular itemset. It measures how
frequently an itemset appears in the dataset.
79. Confidence: The probability that an item B is purchased given that item A is purchased. It measures
the strength of the association between items A and B.

AML Notes UNIT-3 Page 14


80. Lift: The ratio of the observed support to the expected support if A and B were independent. It
measures how much more often A and B are purchased together compared to what would be
expected by chance.
Algorithms for Market Basket Analysis:
1. Apriori Algorithm
Concept: Apriori is a classic algorithm that identifies frequent itemsets using a breadth-first search
strategy. It generates candidate itemsets, counts their occurrences, and prunes those that do not
meet the minimum support threshold.
Algorithm Steps:
81. Generate Frequent Itemsets:
o Start with single items and generate itemsets that meet the minimum support threshold.
o Combine these itemsets to form larger itemsets.
o Repeat until no more itemsets meet the support threshold.
82. Generate Association Rules:
o From the frequent itemsets, generate rules of the form A→BA \rightarrow BA→B, where AAA
and BBB are itemsets.
o Calculate the confidence of these rules.
o Filter rules based on the minimum confidence threshold.
Pros:
• Simple to understand and implement.
• Effective for smaller datasets.
Cons:
• May be inefficient for large datasets due to the combinatorial explosion of itemsets.
2. FP-Growth Algorithm
Concept: FP-Growth (Frequent Pattern Growth) is an improvement over Apriori. It uses a compact
data structure called the FP-tree (Frequent Pattern Tree) to store the dataset and then mines the
frequent itemsets directly from this tree.
Algorithm Steps:
83. Construct FP-tree:
o Scan the database to determine the frequency of items.
o Create a tree structure that represents frequent items and their co-occurrences.
84. Mine Frequent Itemsets:
o Extract frequent itemsets from the FP-tree by recursively mining the tree.
Pros:
• More efficient than Apriori, especially for large datasets.
• Reduces the number of candidate itemsets generated.
Cons:
• More complex to implement compared to Apriori.
• Memory-intensive due to the FP-tree.
3. Eclat Algorithm
Concept: Eclat (Equivalence Class Transformation) uses a depth-first search strategy and a vertical
data format to find frequent itemsets. It intersects transaction lists to find itemsets.
Algorithm Steps:
85. Transform Data:
o Convert the transaction database into a vertical format where each item is associated with a
list of transactions in which it appears.
86. Find Frequent Itemsets:
o Use a depth-first search to intersect transaction lists of items to find frequent itemsets.
Pros:
• Efficient in terms of time complexity compared to Apriori.
• Can handle large datasets effectively.
Cons:
• Requires more memory to store vertical data format.

AML Notes UNIT-3 Page 15


• Requires more memory to store vertical data format.
• Can be complex to implement.
Example Workflow:
87. Data Preparation:
o Gather transaction data, typically represented as a list of items bought in each transaction.
o Convert the data into the required format for the chosen algorithm (e.g., transaction-item
matrix for Apriori).
88. Apply the Algorithm:
o Choose an algorithm (Apriori, FP-Growth, Eclat) based on the dataset size and complexity.
o Set parameters such as minimum support and confidence thresholds.
o Run the algorithm to find frequent itemsets and generate association rules.
89. Analyze Results:
o Review the frequent itemsets and association rules.
o Use metrics like support, confidence, and lift to evaluate the strength of the associations.
o Interpret the results to inform business decisions, such as product placements or promotions.
90. Implement Insights:
o Apply the findings to marketing strategies, inventory management, and store layouts.
o Monitor the effectiveness of the implemented strategies and adjust as necessary.
Market Basket Analysis is a powerful tool for discovering relationships between products and
optimizing business strategies based on customer purchasing behavior.

Introduction

Nowadays, Machine Learning is helping the Retail Industry in many different ways. You can imagine

that from forecasting the performance of sales to identifying the buyers, there are many

applications of AI and ML in the retail industry. Market basket analysis is a data mining technique

retailers use to increase sales by better understanding customer purchasing patterns. Analyzing

large data sets, such as purchase history, reveals product groupings and products likely to be

purchased together. In this article, we will comprehensively cover the topic of Market Basket

Analysis Python and its various components and then dive deep into the ways of implementing it in

machine learning, including how to perform it in Python on a real-world dataset.


Learning Objectives

• To understand what Market Basket Analysis is and how it is used.

• To learn about the various algorithms used in Market Basket Analysis Python.

• To learn to implement the algorithm in Python.


What Is Market Basket Analysis?

Market basket analysis is a strategic data mining technique used by retailers to enhance sales by

gaining a deeper understanding of customer purchasing patterns .This method involves examining

substantial datasets, such as historical purchase records, to unveil inherent product groupings and

identify items that customers tend to buy together.

By recognizing these patterns of co-occurrence, retailers can make informed decisions to optimize

inventory management, devise effective marketing strategies, employ cross-selling tactics, and even

refine store layout for improved customer engagement.

For example, if customers are buying milk, how probably are they to also buy bread (and which kind

of bread) on the same trip to the supermarket? This information may lead to an increase in sales by

helping retailers to do selective marketing based on predictions, cross-selling, and planning their
AML Notes UNIT-3 Page 16
helping retailers to do selective marketing based on predictions, cross-selling, and planning their

ledge space for optimal product placement.

Now, just think of the universe as the set of items available at the store, then each item has a

Boolean variable that represents the presence or absence of that item. Now, we can represent each

basket with a Boolean vector of values assigned to these variables. We can then analyze the Boolean

vectors to identify purchase patterns that reflect items frequently associated or bought together,

representing such patterns in the form of association rules.


How Does Market Basket Analysis Work?

• Collect data on customer transactions, such as the items purchased in each transaction, the

time and date of the transaction, and any other relevant information.

• Clean and preprocess the data, removing any irrelevant information, handling missing values,

and converting the data into a suitable format for analysis.

• Use association rules mining algorithms such as Apriori or FP-Growth to identify frequent item

sets, sets of items often appearing together in a transaction.

• Calculate the support and confidence for each frequent itemset, expressing the likelihood of

one item being purchased given the purchase of another item.

• Generate association rules based on the frequent itemsets and their corresponding support

and confidence values. Association rules indicate the likelihood of purchasing one item given

the purchase of another item.

• Interpret the results of the market basket analysis, identifying frequent purchases, assessing

the strength of the association between items, and uncovering other relevant insights into

customer behavior and preferences.

• Use the insights from the Market Basket Analysis in Data Mining to inform business decisions

such as product recommendations, store layout optimization, and targeted marketing

campaigns.
Types of Market Basket Analysis

• Predictive Market Basket Analysis: It employs supervised learning to forecast future

customer behavior. By recognizing cross-selling opportunities through purchase patterns, it

enables applications like tailored product recommendations, personalized promotions, and


AML Notes UNIT-3 Page 17
enables applications like tailored product recommendations, personalized promotions, and

effective demand forecasting. Additionally, it proves valuable in fraud detection.

• Differential Market Basket Analysis: It compares purchase histories across diverse segments,

unveiling trends and pinpointing buying habits unique to specific customer groups. Its

applications extend to competitor analysis, identification of seasonal trends, customer

segmentation, and insights into regional market dynamics.


Applications of Market Basket Analysis
Industry Applications of Market Basket Analysis
Retail Identify frequently purchased product combinations and create promotions or
cross-selling strategies
E-commerce Suggest complementary products to customers and improve the customer
experience
Hospitality Identify which menu items are often ordered together and create meal packages
or menu recommendations
Healthcare Understand which medications are often prescribed together and identify
patterns in patient behavior or treatment outcomes
Banking/Fina Identify which products or services are frequently used together by customers
nce and create targeted marketing campaigns or bundle deals
Telecommuni Understand which products or services are often purchased together and create
cations bundled service packages that increase revenue and improve the customer
experience

What Is Association Rule for Market Basket Analysis?

Let I = {I1, I2,…, Im} be an itemset. These itemsets are called antecedents. Let D, the data, be a set of

database transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each

transaction is associated with an identifier called a TID(or Tid). Let A be a set of items(itemset). T is

the Transaction that is said to contain A if A ⊆ T. An Association Rule is an implication of form A ⇒ B,

where A ⊂ I, B ⊂ I, and A ∩B = φ.

The rule A ⇒ B holds in the data set(transactions) D with supports, where ‘s’ is the percentage of

transactions in D that contain A ∪ B (i.e., the union of set A and set B, or both A and B). This is taken

as the probability, P(A ∪ B). Rule A ⇒ B has confidence c in the transaction set D, where c is the

percentage of transactions in D containing A that also contains B. This is taken to be the conditional

probability, like P(B|A). That is,

• support(A⇒B) =P(A ∪ B)

• confidence(A⇒B) =P(B|A)

Rules that meet both a minimum support threshold (called min sup) and a minimum confidence

threshold (called min conf) are termed as ‘Strong’.

• Confidence(A⇒B) = P(B|A)

• support(A ∪ B) /support(A)

• support count(A ∪ B) / support count(A)

Generally, Association Rule Mining can be viewed in a two-step process:

• Find all Frequent itemsets: By definition, each of these itemsets will occur at least as

frequently as a pre-established minimum support count, min sup.

• Generate Association Rules from the Frequent itemsets: By definition, these

rules must satisfy minimum support and minimum confidence.

AML Notes UNIT-3 Page 18


rules must satisfy minimum support and minimum confidence.
Association Rule Mining

Association Rule Mining is primarily used when you want to identify an association between

different items in a set and then find frequent patterns in a transactional database or relational

database.

The best example of the association is as you can see in the following image.

Algorithms Used in Market Basket Analysis

There are multiple data mining techniques and algorithms used in Market Basket Analysis Python. In

predicting the probability of items that customers are buying together, one of the important

objectives is to achieve accuracy.

• Apriori Algorithm

• AIS

• SETM Algorithm

• FP Growth
Apriori Algorithm

The Apriori Algorithm widely uses and is well-known for Association Rule mining, making it a popular

choice in Market Basket Analysis Python. AI and SETM algorithms consider it more accurate. It helps

to find frequent itemsets in transactions and identifies association rules between these items. The

limitation of the Apriori Algorithm is frequent itemset generation. It needs to scan the database

many times, leading to increased time and reduced performance as a computationally costly step

because of a large dataset. It uses the concepts of Confidence and Support.


AIS Algorithm

The AIS algorithm creates multiple passes on the entire database or transactional data. During every

pass, it scans all transactions. As you can see, in the first pass, it counts the support of separate

items and determines then which of them are frequent in the database. After each transaction scan,

the algorithm enlarges huge itemsets from each pass to generate candidate itemsets. It determines

the common itemsets between the itemsets of the previous pass and the items of the current

transaction. This algorithm, developed to generate all large itemsets in a transactional database, was

the first published algorithm of its kind.

It focused on the enhancement of databases with the necessary performance to process decision

support. This technique is bounded to only one item in the consequent.

• Advantage: The AIS algorithm was used to find whether there was an association between

items or not.

• Disadvantage: The main disadvantage of the AIS algorithm is that it generates too many

candidates set that turn out to be small. As well as the data structure is to be maintained.
SETM Algorithm

This Algorithm is quite similar to the AIS algorithm. The SETM algorithm creates collective passes

over the database. As you can see, in the first pass, it counts the support of single items and then

determines which of them are frequent in the database. Then, it also generates the candidate

itemsets by enlarging large itemsets of the previous pass. In addition to this, the SETM algorithm

AML Notes UNIT-3 Page 19


itemsets by enlarging large itemsets of the previous pass. In addition to this, the SETM algorithm

recalls the TIDs(transaction ids) of the generating transactions with the candidate itemsets.

• Advantage: While generating candidate itemsets, the SETM algorithm arranges candidate

itemsets together with the TID(transaction Id) in a sequential manner.

• Disadvantage: For every item set, there is an association with Tid; hence it requires more

space to store a huge number of TIDs.


FP Growth

It is known as Frequent Pattern Growth Algorithm. FP growth algorithm is a concept of representing

the data in the form of an FP tree or Frequent Pattern. Hence FP Growth is a method of Mining

Frequent Itemsets. This algorithm is an advancement to the Apriori Algorithm. There is no need for

candidate generation to generate a frequent pattern. This frequent pattern tree structure maintains

the association between the itemsets.

A Frequent Pattern Tree is a tree structure that is made with the earlier itemsets of the data. The

main purpose of the FP tree is to mine the most frequent patterns. Every node of the FP tree

represents an item of that itemset. The root node represents the null value, whereas the lower

nodes represent the itemsets of the data. While creating the tree, it maintains the association of

these nodes with the lower nodes, namely, between itemsets.

For Example:

Advantages of Market Basket Analysis

There are many advantages to implementing Market Basket Analysis in marketing. Market Basket

Analysis (MBA) applies to customer data from point of sale (PoS) systems.

It helps retailers in the following ways:

• Increases customer engagement

• Boosts sales and increases RoI

• Improves customer experience

• Optimizes marketing strategies and campaigns

• Helps in demographic data analysis

AML Notes UNIT-3 Page 20


• Identifies customer behavior and pattern
Market Basket Analysis From the Customers’ Perspective

Let us take an example of market basket analysis from Amazon, the world’s largest eCommerce

platform. From a customer’s perspective, Market Basket Analysis in Data Mining is like shopping at a

supermarket. Generally, it observes all items bought by customers together in a single purchase.

Then it shows the most related products together that customers will tend to buy in one purchase.

Implementing Market Basket Analysis in Python

Let us now implement market basket analysis in python.


Steps to Implement

Here are the steps involved in using the apriori algorithm to implement MBA:

• First, define the minimum support and confidence for the association rule.

• Find out all the subsets in the transactions with higher support(sup) than the minimum

support.

• Find all the rules for these subsets with higher confidence than minimum confidence.

• Sort these association rules in decreasing order.

• Analyze the rules along with their confidence and support.

AML Notes UNIT-3 Page 21

You might also like