Unit-IV
Network Based Algorithm
Solving a classification problem using NNs involves several steps:
1. Determine the number of output nodes as well as what attributes should be used as input.
The number of hidden layers also must be decided. This step is performed by a domain
expert.
2. Determine weights (labels) and functions to be used for the graph.
3. For each tuple in the training set, propagate it through the network and evaluate the
output prediction to the actual result.
4. For each tuple ti ϵ D, propagate t; through the network and make the appropriate
classification.
Issues
• Attributes (number of source nodes): This is the same issue as determining which attributes
to use as splitting attributes.
• Number of hidden layers: In the simplest case, there is only one hidden layer.
• Number of hidden nodes: Choosing the best number of hidden nodes per hid den layer is one
of the most difficult problems when using NNs. There have been many empirical and theoretical
studies attempting to answer this question. The answer depends on the structure of the NN, types
of activation functions, training algorithm, and problem being solved. If too few hidden nodes
are used, the target function may not be learned (underfitting). If too many nodes are used,
overfit ting may occur.
• Training data: As with DTs, with too much training data the NN may suffer from overfitting,
while too little and it may not be able to classify accurately enough.
• Number of sinks: Although it is usually assumed that the number of output nodes is the same
as the number of classes, this is not always the case.
• Interconnections: In the simplest case, each node is connected to all nodes in the next level.
• Weights: The weight assigned to an arc indicates the relative weight between those two nodes.
Initial weights are usually assumed to be small positive numbers and are assigned randomly.
• Activation functions: Many different types of activation functions can be used.
• Learning technique: The technique for adjusting the weights is called the learning technique.
Although many approaches can be used, the most common approach is some form of
backpropagation, which is discussed in a subsequent subsection.
• Stop: The learning may stop when all the training tuples have propagated through the network
or may be based on time or error rate.
Advantages to the use of NNs for classification:
• NNs are more robust than DTs because of the weights.
• The NN improves its performance by learning. This may continue even after the training set
has been applied.
• The use of NNs can be parallelized for better performance.
• There is a low error rate and thus a high degree of accuracy once the appropriate training has
been performed
. • NNs are more robust than DTs in noisy environments.
NNs disadvantages:
• NNs are difficult to understand. Nontechnical users may have difficulty understanding how
NNs work. While it is easy to explain decision trees, NNs are much more difficult to understand.
• Generating rules from NNs is not straightforward.
• Input attribute values must be numeric.
• Testing
• Verification
• As with DTs, overfitting may result.
• The learning phase may fail to converge.
• NNs may be quite expensive to use.
Propagation
The normal approach used for processing is called propagation.
Given a tuple of values input to the NN, X= (X1, . . . , Xh), one value is input at each node in the
input layer. Then the summation and activation functions are applied at each node, with an
output value created for each output arc from that node. These values are in turn sent to the
subsequent nodes. This process continues until a tuple of output values, Y = (y1, . . . , Ym ), is
produced from the nodes in the output layer.
The process of propagation is shown in Algorithm
Example
Figure shows a very simple NN used to classify university students as short, medium, or tall
Activation function h is associated with the short class, !4 is associated with the medium class,
and fs is associated with the tall class. In this case, the weights of each arc from the height node
is 1. The weights on the gender arcs is 0. This implies that in this case the gender values are
ignored.
NN Supervised Learning
The NN starting state is modified based on feedback of its performance with the data in the
training set. This type of learning is referred to as supervised because it is known a priori what
the desired output should be. Unsupervised learning can also be performed if the output is not
known.
Supervised learning in an NN is the process of adjusting the arc weights based on its
performance with a tuple from the training set. The behavior of the training data is known a
priori and thus can be used to fine-tune the network for better behavior in future similar
situations
Algorithm.
The output from node i is yi but should be di, the error produced from a node in any layer can be
found by
|yi – di|
The mean squared error (MSE) is found by
The total MSE error over all m output nodes in the NN is
This formula could be expanded over all tuples in the training set to see the total error over all of
them.
The Hebb and delta rules are approaches to change the weight on an input arc to a node based on
the knowledge that the output value from that node is incorrect. With both techniques, a learning
rule is used to modify the input weights.
The change in weights using the Hebb rule is represented by the following rule
Here c is a constant often called the learning rate.
A rule of thumb is that c = 1 / |# entries in training set|
Backpropagation is a learning technique that adjusts weights in the NN by prop agating weight
changes backwa.td from the sink to the source nodes. Backpropagation is the most well known
form of learning because it is easy to understand and generally applicable.
Figure shows the structure and use of one node, j, in a neural network graph.
The basic node structure is shown in part (a). Here the representative input arc has a weight of
W?j. where ? is used to show that the input to node j is corning from another node shown here
as ?. Of course, there probably are multiple input arcs to a node. The output weight is similarly
labeled w J?·
During propagation, data values input at the input layer flow through the network, with final
values corning out of the network at the output layer. The propagation technique is shown in part
(b). The activation function fj is applied to all the input values and weights, with output values
resulting.
Weights are changed based on the changes that were made in weights in subsequent arcs. This
backward learning process is called backpropagation and is illustrated in Figure ( c). Weight wj? is
modified to become w j? + ∆w j?. A learning rule is applied to this ∆w j? to determine the change at the
next higher level ∆W? j.
ALGORITHM
The MSE is used to calculate the error. The last step of the algorithm uses gradient descent as
the technique to modify the weights in the graph. The basic idea of gradient descent is to find the
set of weights
Figure and Algorithm illustrate the concept.
The stated algorithm assumes only one hidden layer. More hidden layers would be handled in the
same manner with the error propagated backward.
Figure shows the structure we use to discuss the gradient descent algorithm.
Here node i is at the output layer and node j is at the hidden layer just before it; the output of i
and yi is y j is the output of j.
The learning function in the gradient descent technique is based on using the following value for
delta at the output layer:
Here the weight wij is that at one arc coming into i from j. Assuming a sigmoidal activation
function in the output layer
Radial Basis Function Networks
A radial function or a radial basis function (REF) is a class of functions whose value decreases
(or increases) with the distance from a central point.
An RBF has a Gaussian shape, and an RBF network is typically an NN with three layers.
1. The input layer is used to simply input the data.
2. A Gaussian activation function is used at the hidden layer,
3. while a linear activation function is used at the output layer.
The objective is to have the hidden nodes learn to respond only to a subset of the input,
namely, that where the Gaussian function is centered. This is usually accomplished via
supervised learning
Perceptrons
The simplest NN is called a perceptron.
A perceptron is a single neuron with multiple inputs and one output.
The original perceptron proposed the use of a step activation function, but it is more
common to see another type of function such as a sigmoidal function.
A simple perceptron can be used to classify into two classes.
o Using a unipolar activation function, an output of 1 would be used to classify into
one class,
o while an output of 0 would be used to pass in the other class
Here x1 is shown on the horizontal axis and x2 is shown on the vertical axis. The area of the
plane to the right of the line x2 = 3 _ 3 /2x1 represents one class and the rest of the plane
represent the other class.
RULE-BASED ALGORITHMS
To perform classification is to generate if-then rules that cover all cases.
Example
If 90 <= grade, then class= A
I f 80 <= grade and grade < 90, then class=B
Definition
A classification rule, r = (a, c), consists of the if or antecedent, a, part and the then or
consequent portion, c. The antecedent contains a predicate that can be evaluated as true or false
against each tuple in the database
Thedifferences between rules and trees:
• The tree has an implied order in which the splitting is performed. Rules have no order·
• A tree is created based on looking at all classes. When generating rules, only one class must be
examined at a time.
Generating Rules from a DT
The process to generate a rule from a DT is straightforward and is outlined in Algorithm
ALGORlTHM
Input: T //Decision tree
Output: R //Rules
Gen algorithm:
//Illustrate simple approach to generating classification rules from a DT
R=0
for each path from root to a leaf in T do
a= True
for each non-leaf node do
a= a/\ (label of node combined with label of incident outgoing arc)
c = label of leaf node
R = R U r = (a, c)
This algorithm will generate a rule for each leaf node in the decision tree. All rules with the same
consequent could be combined together by the antecedents of the simpler rules.
Generating Rules from a Neural Net
The source NN may still be used for classification, the derived rules can be used to verify or
interpret the network. The problem is that the rules do not explicitly exist. They are buried in the
structure of the graph itself.
The basic idea of the RX algorithm is to cluster output values with the associated hidden nodes
and input.
A major problem with rule extraction is the potential size that these rules should be. For
example, if you have a node with n inputs each having 5 values, there are 5 n different input
combinations to this one node alone.
To overcome this problem and that of having continuous ranges of output values from nodes, the
output values for both the hidden and output layers are first discretized
Generating Rules without a DT or NN
These techniques are sometimes called covering algorithms because they attempt to generate
rules exactly cover a specific class
Tree algorithms work in a top down divide and conquer approach, but this need not be the case
for covering algorithms. They generate the best rule possible by optimizing the desired
classification probability.
To generate a rule to classify persons as tall. The basic format for the rule is then
If ? then class = tall
The objective for the covering algorithms is to replace the "?" in this statement with predicates
that can be used to obtain the "best" probability of being tall.
The basic idea is to choose the best attribute to perform the classification based on the training
data. "Best" is defined here by counting the number of errors.
As with ID3, 1R tends to choose attributes with a large number of values leading to overfitting.
To generating rules without first having a DT is called PRISM. PRISM generates rules for each
class by looking at the training data and adding rules that completely describe all tuples in that
class.
COMBINING TECHNIQUES
Given a classification problem, no one classification technique always yields the best results.
Therefore, there have been some proposals that look at combining techniques.
Two basic techniques can be used to accomplish this:
• A synthesis of approaches takes multiple techniques and blends them into a new approach.
Example: linear regression, to predict a future value for an attribute that is then used as input to a
classification NN. In this way the NN is used to predict a future classification value.
• Multiple independent approaches can be applied to a classification problem, each yielding its
own class prediction. This approach has been referred to as combination of multiple classifiers
( CMC).
The values are combined with a weighted linear combination
Example
Two classifiers exist to classify tuples into two classes. A target tuple, X, needs to be classified.
Using a nearest neighbor approach, the 10 tuples closest to X are identified.
Figure shows the 10 tuples closest to X
Here the weights, Wk. can be assigned by a user or learned based on the past accuracy of each
classifier. Another technique is to choose the classifier that has the best accuracy in a database
sample. This is referred to as a dynamic classifier selection (DCS).