UNIT 3
MACHINE LEARNING
TREE MODELS
• Feature Tree:
A compact way of representing a number of
conjunctive concepts in the hypothesis space.
• Tree:
1. Internal nodes as Features
2. Edges labelled as Literals.
3. Split : set of literals at a node.
4. Leaf: Logical expression conjunction of literals
in the path from root to that edge
TREE MODELS
• Generic Algorithms
Three functions:
1. Homogenous(D) all instances belong to
True/False (Single Class)
2. Label(D) returns label of D
3. BestSplit(D,F) on which feature the dataset
is divided(two classes/more
classes)
TREE MODELS
• Divide-and-Conquer algorithm:
it divides the data into subsets,builds a tree for each
of those and then combines those subtrees into a
single tree.
• Greedy:
whenever there is a choice (such as choosing the
best split), the best alternative is selected on the basis
of the information then available, and this choice is
never reconsidered.
• Backtracking search algorithm:
which can return an optimal solution, at the expense
of increased computation time and memory
requirements
DECISION TREES
• Classification Task: D
Homogenous(D)
Single Label(D)
Non- Homogenous(D) Majority Class Label
D Di =Ø (zero)
D 1 D2
DECISION TREES
• D + 1 = D+ and D -1 = Ø
• D -1 = D - and D+1 = Ø
pure
• Impurity: n+ , n- (impurity depends only on magnitude)
• Impurity is measured in Proportional format p˙ = n+ / (n++n-) empirical
probability of positive class.
• Aim : We need a function that returns
0 if p=0 or p=1
½ if p reaches maximum value
FUNCTIONS
1. MINORITY CLASS(Error Rate)
2. GINI INDEX(Expected Error Rate).
3. ENTROPY(Expected Information)
MINORITY CLASS
• Min(p,1-p), it returns error rate.
• Minority class is Proportion to misclassified examples.
• Spam=40 majority class,
Ham=10misclassified(minority class)
• If set of instances are Pure set fewer(no error)
• Minority class as impurity class then ½ -|p- ½ |
GINI INDEX
• It is an expected error rate.
• Randomly assigns a label to instances.
• P(positive instances), 1-p(negative instances)
• False positive p (1-p)
• False Negative (1-p) p
ENTROPY
• It is an Expected Information.
• Formula: -p log 2p – (1-p) log 2(1-p)
Decision Trees
Entropy
Gini Index
Decision Tree
• K>2
• One vs rest
• K class Entropy =
• K class Gini Index=
RANKING AND PROBABILITY ESTIMATION
• Grouping classifiers divide instance space into segments.
• Instance space
• Segments
• Rankers by learning an ordering algorithm
• Decision trees (can access Local class distribution) directly used
to construct Leaf ordering in an optimal way.
• Using Empirical Probability easy to calculate leaf ordering.
• Highest priority for Positives.
• Convex ROC Curve.
the empirical probability of the parent is a weighted average of the
empirical probabilities of its children; but this only tells us that p˙1 ≤ p˙ ≤ p˙2 or p˙2 ≤ p˙ ≤
p˙1.
• Tree is a feature tree with unlabelled data.
• How many ways we can label the tree and the
performance.
• If we know the number of positives and
negatives.
• L-labels, C-classes then CL ways to arrange the
leaves.
• Ex:24= 16 ways.
• Graph follows symmetry property.
• +-+-, -+-+ they are locating at same
place(symmetric property).
• Path of coverage corner contains optimal
• ----, --+-, +-+-, +-++, ++++
• L labels then L! permutations are possible.
• Feature tree is turned into
-- Rankers (Order leaves in descending
order based on Empirical
probability.
-- Probability Estimator(Predict Empirical
probability in each leaf or calculate
Laplace or m-estimate)
-- Classifier(choose operating conditions , find
the operating point that fits the
conditions
• the optimal labelling under these operating conditions
is +−++.
• use the second leaf to filter out negatives.
• In other words, the right two leaves can be merged into
one – their parent.
• the operation of merging all leaves in a subtree is called
pruning the subtree.
• The advantage of pruning is that we can simplify the
tree without affecting the chosen operating point,
which is sometimes useful
• if we want to communicate the tree model to
somebody else
• The disadvantage is that we lose ranking performance,
Sensitivity to Skewed Class Distribution
• Parent p Gini index = 2(n + / n)(n - / n)
• Average Weight of Gini index children
n1 = n1+ + n1-
n1/n * 2(n + / n)(n - / n)
• Relative impurity= sqrt(n1+ * n1- )/ (n + * n - )
How you would train decision Trees for
a dataset
• Good Ranking Estimator.
• Distributive-Insensitive data
• Disable Pruning.
• Operating Condition, Operating point ROC.
• Prune all the leaves at the same level.
Tree Learning as Variance Reduction
• Gini Index 2p(1-p) Expected error rate.
• Label instances randomly.
• Coin---Head, tail probability of occurring of
head is p then variance is p(1-p)
P is occurring
1-p is non-occurring
REGRESSION TREE
Regression Tree
Model(A100,B3,E112,M102,T202)
• A100[1051,1770,1900]mean=1574
• B3[4513] mean=4513
• E112[77] mean=77
• M102[870] mean=870
• T202[99,270,625] mean= 331
Calculate variance:
• A100
1/9 sq(1574-1051)+sq(1574-1770)+sq(1574-1900)=
1/9(523)+(-196)+(-326)=
273529+38416+106,276=46469
• B3
1/9sq(4513-4513)=0
• E112
1/9sq(77-77)=0
• M102
1/9sq(870-870)=0
• T202
1/9sq(331-99)+sq(331-270)+sq(331-625)=1/9(232+61+(-
294))=15997
• Calculate weigthed average of Model:-
• 3/9(46469)+0+0+0+3/9(15597)=2,686.5978
• Similarly for condition(excellent, good, fair)
excellent[1770,4513]mean=3142
good[270,870,1051,1900] mean=1023
fair[77,99,625] mean=267
Variance:-
• Excellent
1/9 sq(3142-1770)+sq(3142-4513)=1372+1371=418002
• good
1/9sq(1023-270)+sq(1023-870)+sq(1023-1051)+sq(1023-1900)
=1/9*sq(753)+sq(153)+sq(28)+sq(877)=
=1/9*567009+ 23409+ 784+769,129
=1,51,147
• fair
1/9(267-77)+(267-99)+(267-625)=190+168+358=21331
• Weighted Average of condition:-
2/9(418002)+4/9(151147)+3/9(21331)=
167,176.1111
• Similarly for Leslie(yes,no)
yes[625,870,1900] mean=1132
no[77,99,270,1051,1770,4513] mean= 1297
Variance:-
• Yes
1/9 sq(1132-625)+(1132-870)+(1132-1900)
=1/9 sq(507)+262+(-768)=101,704.11
• No
1/9 sq(1297-77)+(1297-99)+(1297-270)+(1297-1051)+(1297-
1770)+(1297-4513)
=1/9 sq(1220)+1198+1027+246+(-473)+(-3216)
=16223803.77
• Calculate weighted average of Leslie:-
• 3/9* 101,704.11 + 6/9* 16223803.77
=33901.36+10815869.180
=10849770.54
Weighted averages :
1. Model= 2,686.5978
2. Condition= 167,176.1111
3. Leslie= 10849770.54
• For A100 the splits are
Condition[excellent,good,fair]
[1770] [1051,1900] [] ignored
Leslie[yes,no] [1900][1051,1770]calculate variance
• For T202 the splits are
Condition[excellent,good,fair][] [270][99,625]ignored
Leslie[yes,no] - [625][99,270] calculate variance
Regression Tree
Clustering Trees
• Regressions finds an instance space segment that
target values are tightly clustered around the mean.
• Variance of set of target value is average
squared Euclidean distance to mean.
•
• Learning a clustering tree using
1. Dissimilarity Matrix.
2. Euclidean distance
• For A100 the means of the three numerical
features(price, reserve,bids)
11,8,13
18,15,15
19,19,1
• vectors (means) are(16,14,9.7)
• Variance is:
1/3sq(16-11)+(16-18)+(16-19)=1/3sq(5)+(-2)+(-
3)=12.7
• 1/3sq(14-8)+(14-15)+(14-19)= 20.7
• 1/3sq(9.7-13)+(9.7-15)+(9.7-1)=38.2
RULE MODELS
• Logical Models:
1. Tree models.
2. Rule models.
• Rule models consist of a collection of implications
or if–then rules.
• if-part defines a segment, and the then-part
defines the behaviour of the model in this
segment
• Two Approaches:
1. find a combination of literals – the body of the
rule, which is called a concept – that covers a
sufficiently homogeneous set of examples, and
find a label(class) to put in the head of the rule.
Ordered sequence of Rules Rule Lists
2. first select a class you want to learn, and then find
rule bodies that cover (large subsets of ) the
examples of that class.
Unordered collection of Rules Rule Sets
Learning Ordered Rule Lists
• Growing Rule body that improves Homogeneity
• Decision Tree Rule Lists
C1 C2 True False
Impurity for 2 classes Only for 1 children
• Separate and Conquer
many
many
1-
1-
1-
[0+,0-] [0+,0-]
1-
many
0-
0-
1-
0-
Learning Unordered Rule Sets
• Alternative approach to rule learning.
• Rules are learned for one class at a time.
• minimizing min(p˙, 1 − p˙).
• maximize p˙, the empirical probability of the
class.
Descriptive Rule Learning
• Descriptive models can be learned in either a
supervised or an unsupervised way.
• Supervised:
how to adapt the given rule learning
algorithms ---- subgroup discovery.
• Unsupervised Learning:
---frequent item sets and association rule
discovery.
Learning from Sub group Discovery
• Equal Proportion of Positives to Overall
Population.
1. Precision
|Prec – Pos|
2. Average-Recall
|avgrec – 0.5|
3. Weighted Relative Accuracy
= Pos * Neg (TPR - FPR)
Association Rule Mining