Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views16 pages

Classification 4

The document discusses decision tree induction as a supervised learning method for classification and regression, detailing its structure, key factors like entropy and information gain, and its advantages and disadvantages. It also covers Bayesian classification methods, rule-based classification, and k-nearest-neighbor classifiers, highlighting their respective algorithms, applications, and issues in data preparation. Overall, it provides a comprehensive overview of various classification techniques in data mining.

Uploaded by

vivekwolf61
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Classification 4

The document discusses decision tree induction as a supervised learning method for classification and regression, detailing its structure, key factors like entropy and information gain, and its advantages and disadvantages. It also covers Bayesian classification methods, rule-based classification, and k-nearest-neighbor classifiers, highlighting their respective algorithms, applications, and issues in data preparation. Overall, it provides a comprehensive overview of various classification techniques in data mining.

Uploaded by

vivekwolf61
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

FDS

4.CLASSIFICATION

Decision Tree Induction

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.

Each leaf node represents a class.

Decision Tree Induction

 Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes.
 The decision tree creates classification or regression models as a tree structure. It separates
a data set into smaller subsets, and at the same time, the decision tree is steadily developed.
The final tree is a tree with the decision nodes and leaf nodes.
 A decision node has at least two branches. The leaf nodes show a classification or decision.
We can't accomplish more split on leaf nodes-The uppermost decision node in a tree that

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

relates to the best predictor called the root node. Decision trees can deal with both
categorical and numerical data.

Key factors Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

Information Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy
Reduction. Building a decision tree is all about discovering attributes that return the highest data
gain.

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions.
Starting with the dataset, we can measure the entropy to find a way to segment the set until the
data belongs to the same class.

Why are decision trees useful?

 It enables us to analyze the possible consequences of a decision thoroughly.


 It provides us a framework to measure the values of outcomes and the probability of
accomplishing them.
 It helps us to make the best decisions based on existing data and best speculations.
 In other words, we can say that a decision tree is a hierarchical tree structure that can be
used to split an extensive collection of records into smaller sets of the class by
implementing a sequence of simple decision rules.
 A decision tree model comprises a set of rules for portioning a huge heterogeneous
population into smaller, more homogeneous, or mutually exclusive classes. The attributes
of the classes can be any variables from nominal, ordinal, binary, and quantitative values,
in contrast, the classes must be a qualitative type, such as categorical or ordinal or binary.
 In brief, the given data of attributes together with its class, a decision tree creates a set of
rules that can be used to identify the class. One rule is implemented after another, resulting
in a hierarchy of segments within a segment.
 The hierarchy is known as the tree, and each segment is called a node. With each
progressive division, the members from the subsequent sets become more and more similar

to each other. Hence, the algorithm used to build a decision tree is referred to as recursive

partitioning. The algorithm is known as CART (Classification and Regression Trees) Consider the

given example of a factory where

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

 Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which
leads to $8 million profit, and the probability of a bad economy is 0.4 (40%), which leads
to $6 million profit.
 Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which
leads to $4 million profit, and the probability of a bad economy is 0.4, which leads to $2
million profit.
 The management teams need to take a data-driven decision to expand or not based on the
given data.

NetExpand=(0.6*8+0.4*6)-3=$4.2M
NetNotExpand=(0.6*4+0.4*2)-0=$3M
$4.2M > $3M,therefore the factory should be expanded.

Decision Tree Terminologies


• Root Node: A decision tree’s root node, which represents the original choice or
feature from which the tree branches, is the highest node.

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

• Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined
by the values of particular attributes. There are branches on these nodes that go to
other nodes.
• Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are
decided upon. There are no more branches on leaf nodes.
• Branches (Edges): Links between nodes that show how decisions are made in
response to particular circumstances.
• Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of
data.
• Parent Node: A node that is split into child nodes. The original node from which a
split originates.
• Child Node: Nodes created as a result of a split from a parent node.
• Decision Criterion: The rule or condition used to determine how the data should be
split at a decision node. It involves comparing feature values against a threshold.
• Pruning: The process of removing branches or nodes from a decision tree to improve
its generalisation and prevent overfitting.

Example of a Decision Tree Algorithm


Forecasting Activities Using Weather Information
• Root node: Whole dataset
• Attribute : “Outlook” (sunny, cloudy, rainy).
• Subsets: Overcast, Rainy, and Sunny.

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

• Recursive Splitting: Divide the sunny subset even more according to humidity, for
example.
• Leaf Nodes: Activities include “swimming,” “hiking,” and “staying inside.”

Advantages of Decision Tree


• Easy to understand and interpret, making them accessible to non-experts.
• Handle both numerical and categorical data without requiring extensive
preprocessing.
• Provides insights into feature importance for decision-making.
• Handle missing values and outliers without significant impact.
• Applicable to both classification and regression tasks.
Disadvantages of Decision Tree
• Disadvantages include the potential for overfitting
• Sensitivity to small changes in data, limited generalization if training data is not
representative

• Potential bias in the presence of imbalanced data.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attribute

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

o Step-4: Generate the decision tree node, which contains the best attribute.

o Step-5: Recursively make new decision trees using the subsets of the dataset created

o Step-6:Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Advantages of the Decision Tree o It is simple to understand as it follows the same process
which a human follow while making any decision in real-life.

o It can be very useful for solving decision-related problems.

o It helps to think about all the possible outcomes for a problem.

o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.

o For more class labels, the computational complexity of the decision tree may increase.

Bayes Classification Methods

In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record can’t be assumed with
certainty even though its attribute set is the same as some of the training examples. These
circumstances may emerge due to the noisy data or the presence of certain confusing factors that
influence classification, but it is not included in the analysis.

For example, consider the task of predicting the occurrence of whether an individual is at risk for
liver illness based on individuals eating habits and working efficiency. Although most people who

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

eat healthy and exercise consistently having less probability of occurrence of liver disease, they
may still do so due to other factors. For example, due to consumption of the high-calorie street
foods and alcohol abuse. Determining whether an individual's eating routine is healthy or the
workout efficiency is sufficient is also subject to analysis, which in turn may introduce
vulnerabilities into the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability
to provide an algorithm that uses evidence to calculate limits on an unknown parameter.

Bayes's theorem is expressed mathematically by the following equation that is given belo

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.

Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects
the degree of belief in a hypothesis before and after accounting for evidence. For example, lets us

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

consider an example of the coin. If we toss a coin, then we get either heads or tails, and the percent
of occurrence of either head and tails is 50%. If the coin is flipped numbers of times, and the
outcomes are observed, the degree of belief may rise, fall, or remain the same depending on the
outcomes.

For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X o P(X/Y), the

posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.

o It can be used for Binary as well as Multi-class Classifications.

o It performs well in multi-class predictions as compared to the other Algorithms. o It is

the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring. o It is used in medical data classification. o It can be

used in real-time predictions because Naïve Bayes Classifier is an eager learner.

Rule-based classification

o Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus
these classifiers are generally used to generate descriptive models. The condition used
with “if” is called the antecedent and the predicted class of each rule is called the
consequent.

Properties of rule-based classifiers:


Coverage: The percentage of records which satisfy the antecedent conditions of a
particular rule. The rules generated by the rule-based classifiers may not be exhaustive,
i.e. there may be some records which are not covered by any of the rules.
The decision boundaries created by them is linear, but these can be much more complex
than the decision tree because the many rules are triggered for the same record.

Using IF-THEN Rules for Classification


IF-THEN Rule
To define the IF-THEN rule, we can split it into two parts:

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

Rule Antecedent: This is the “if condition” part of the rule. This part is present
in the LHS(Left Hand Side). The antecedent can have one or more attributes as conditions, with
logic AND operator.

Rule Consequent: This is present in the rule's RHS(Right Hand Side). The rule consequent
consists of the class prediction.

Example:
R1: IF tutor = coding Ninja AND student = yes
THEN happy Learning = true
Properties of Rule-Based Classifiers
There are two significant properties of rule-based classification in data mining. They are:
• Rules may not be mutually exclusive
• Rules may not be exhaustive

Rule Pruning
The Assessment of quality is made on the original set of training data. The rule may perform well on

training data but less well on subsequent data. That is why the rule pruning is required.

FOIL is one of the simple and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.

Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.

Advantages of Rule-Based Classification

• The rule-based classification is easy to generate.

• It is highly expressive in nature and very easy to understand.

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

• It assists in the classification of new records in significantly less time and very quickly.

• It helps us to handle redundant values during classification properly.

k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is labor
intensive when given large training sets, and did not gain popularity until the 1960s when
increased computing power became available. It has since been widely used in the area of pattern
recognition.

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test
tuple with training tuples that are similar to it. The training tuples are described by n attributes.
Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are
stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training tuples that are closest to the unknown tuple.
These k training tuples are the k ―nearest neighbors‖ of the unknown tuple.

―Closeness‖ is defined in terms of a distance metric, such as Euclidean distance. The

Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2 = (x21,
x22, : : , x2n),

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves
the following activities, such as:

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques, and the problem of
missing values is solved by replacing a missing value with the most commonly occurring
value for that attribute.

2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis
is used to know whether any two given attributes are related.

3. Data Transformation and reduction: The data can be transformed by any of the following
methods. o Normalization: The data is transformed using normalization. Normalization
involves scaling all values for a given attribute to make them fall within a small specified
range. Normalization is used when the neural networks or the methods involving
measurements are used in the learning step.

o Generalization: The data can also be transformed by generalizing it to the higher


concept. For this purpose, we can use the concept hierarchies.
NOTE: Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here are the criteria for comparing the methods of Classification and Prediction, such as:

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

o Accuracy: The accuracy of the classifier can be referred to as the ability of the classifier
to predict the class label correctly, and the accuracy of the predictor can be referred to as how
well a given predictor can estimate the unknown value.

o Speed: The speed of the method depends on the computational cost of generating and using
the classifier or predictor.

o Robustness: Robustness is the ability to make correct predictions or classifications. In the


context of data mining, robustness is the ability of the classifier or predictor to make correct
predictions from incoming unknown data.

o Scalability: Scalability refers to an increase or decrease in the performance of the classifier

or predictor based on the given data.

o Interpretability: Interpretability is how readily we can understand the reasoning behind

predictions or classification made by the predictor or classifier.

Precision and Recall


Precision is the proportion of correct positive classifications (true positive) divided by the total
number of predicted positive classifications that were made (true positive + false positive).
Recall is the proportion of correct positive classifications (true positive) divided by the total
number of the truly positive classifications (true positive + false negative).
A PR curve is simply a graph with Precision values on the y-axis and Recall values on the
xaxis. In other words, the PR curve contains 𝑇𝑃𝑇𝑃+𝐹𝑃TP+FPTP on the y-axis and
𝑇𝑃𝑇𝑃+𝐹𝑁TP+FNTP on the x-axis.
• It is important to note that Precision is also called the Positive Predictive Value (PPV).

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

• The recall is also called Sensitivity, Hit Rate, or True Positive Rate (TPR). The figure
below shows a comparison of sample PR and ROC curves.

Interpreting a Precision-Recall Curve


It is desired that the algorithm should have both high precision and high recall. However, most
machine learning algorithms often involve a trade-off between the two. A good PR curve has
greater AUC (area under the curve). In the figure above, the classifier corresponding to the blue
line has better performance than the classifier corresponding to the green line. It is important to
note that the classifier that has a higher AUC on the ROC curve will always have a higher AUC
on the PR curve as well.

Need for a PR curve when the ROC curve exists?


PR curve is particularly useful in reporting Information Retrieval results.
Information Retrieval involves searching a pool of documents to find ones that are relevant to a
particular user query. For instance, assume that the user enters a search query “Pink Elephants”.
The search engine skims through millions of documents (using some optimized algorithms) to
retrieve a handful of relevant documents. Hence, we can safely assume that the no. of relevant
documents will be much less compared to the no. of non-relevant documents.
In this scenario,
• TP = No. of retrieved documents that are relevant (good results).
• FP = No. of retrieved documents that are non-relevant (bogus search results).
• TN = No. of non-retrieved documents that are non-relevant.
• FN = No. of non-retrieved documents that are relevant (good documents we missed).

MS. Navya Shree A ,Dept of BCA,SSIBM


FDS

ROC curve is a plot containing Recall = TPR = 𝑇𝑃𝑇𝑃+𝐹𝑁TP+FNTP on the x-axis and FPR =
𝐹𝑃𝐹𝑃+𝑇𝑁FP+TNFP on the y-axis. Since the no. of true negatives, i.e. non-retrieved
documents that are non-relevant, is such a huge number, the FPR becomes insignificantly
small.
Further, FPR does not help us evaluate a retrieval system well because we want to focus more on
the retrieved documents, and not the non-retrieved ones. PR curve helps solve this issue. PR
curve has the Recall value (TPR) on the x-axis, and precision = 𝑇𝑃𝑇𝑃+𝐹𝑃TP+FPTP on the
yaxis. Precision helps highlight how relevant the retrieved results are, which is more
important while judging an IR system. Hence, a PR curve is often more common around
problems involving information retrieval.

MS. Navya Shree A ,Dept of BCA,SSIBM

You might also like