Text
Mining
and
Classifica1on
Karianne
Bergen
[email protected]
Ins1tute
for
Computa1onal
and
Mathema1cal
Engineering,
Stanford
University
Machine
Learning
Short
Course
|
August
11-‐15
2014
1
Text
Classifica1on
• Determine
a
characteris1c
of
a
document
based
on
the
text:
– Author
iden1fica1on
– Sen1ment
analysis
(e.g.
posi1ve
vs.
nega1ve
review)
– Subject
or
topic
category
– Spam
filtering
Machine
Learning
Short
Course
|
August
11-‐15
2014
2
Text
Classifica1on
hTp://www.theshedonline.org.au/ac1vi1es/ac1vity/scam-‐email-‐examples
Machine
Learning
Short
Course
|
August
11-‐15
2014
3
Document
Features
• How
do
we
generate
a
set
of
input
features
from
a
text
document
to
pass
to
the
machine
learning
algorithm?
– Bag
of
words
/
term-‐document
matrix
– N-‐grams
Machine
Learning
Short
Course
|
August
11-‐15
2014
4
Bag-‐of-‐Words
Model
• Representa1on
of
text
data
in
terms
of
frequencies
of
words
from
a
dic1onary
– The
grammar
and
ordering
of
words
are
ignored
– Just
keep
the
(unordered)
list
of
words
that
appear
and
the
number
of
1mes
they
appear
Machine
Learning
Short
Course
|
August
11-‐15
2014
5
Bag-‐of-‐Words
Model
Machine
Learning
Short
Course
|
August
11-‐15
2014
6
Term-‐Document
Matrix
• Term-‐document
matrix
useful
for
working
with
text
data
– Sparse
matrix,
describes
frequency
of
words
occurring
in
a
collec1on
of
documents
– Rows
represent
terms/words,
Columns
represent
individual
documents
– Entry
(𝑖,𝑗) gives
number
of
occurrences
of
term
𝑖
in
document
𝑗
Machine
Learning
Short
Course
|
August
11-‐15
2014
7
Term-‐Document
Matrix
• Example
– Documents:
1. “one
fish
two
fish”
2. “red
fish
blue
fish”
3. “black
fish
blue
fish”
4. “old
fish
new
fish”
– Terms:
“one”,
“two”,
“fish”,
“red”,
“blue”
“black”
“old”,
“new”
Machine
Learning
Short
Course
|
August
11-‐15
2014
8
Term-‐Document
Matrix
Document
1
2
3
4
“one”
1
0
0
0
“two”
1
0
0
0
“fish”
2
2
2
2
Term
“red”
0
1
0
0
“blue”
0
1
1
0
“black”
0
0
1
0
“old”
0
0
0
1
“new”
0
0
0
1
Machine
Learning
Short
Course
|
August
11-‐15
2014
9
N-‐gram
• N-‐gram:
a
con1guous
sequence
of
𝑛
items
(e.g.
words
or
characters)
• Used
for
language
modeling
-‐
features
retain
informa1on
related
to
word
ordering
• e.g.
“It's
kind
of
fun
to
do
the
impossible.”
-‐
Walt
Disney
– 3-‐grams:
“It’s
kind
of,”
“kind
of
fun,”
“of
fun
to,”
“fun
to
do,”
“to
do
the”,
“do
the
impossible,”
“the
impossible
it’s”
“impossible
it’s
kind”
Machine
Learning
Short
Course
|
August
11-‐15
2014
10
Text
Mining:
NMF
• Unsupervised
learning
method
for
dimensionality
reduc1on
• NMF
is
a
type
of
matrix
factoriza1on
– Original
matrix
and
factors
only
contain
posi1ve
or
zero
values
– For
dimensionality
reduc1on
and
clustering
– Non-‐nega1vity
of
factors
makes
the
results
easier
to
interpret
than
other
factoriza1ons
Machine
Learning
Short
Course
|
August
11-‐15
2014
11
Nonnega1ve
Matrix
Factoriza1on
• NMF
factors
matrix
𝑋
into
product
of
two
non-‐
nega1ve
matrices:
𝑋≈𝑊𝐻,
𝑊≥0, 𝐻 ≥0
• 𝑊
is
the
“dic1onary”
matrix
and
columns
are
“metafeatures”,
𝐻
is
coefficient
matrix
Machine
Learning
Short
Course
|
August
11-‐15
2014
12
NMF
for
Text
• 𝑋∈ℝ↑𝑡 𝑥 𝑑
:
term-‐document
matrix
• 𝑊∈ℝ↑𝑡 𝑥 𝑘 :
𝑘
columns
(“metafeatures”)
,
each
represen1ng
a
collec1on
of
terms
• 𝐻∈ℝ↑𝑘 𝑥 𝑑 :
coefficients
• Each
document
is
represented
as
a
posi1ve
combia1on
of
the
𝑘
metafeatures
Machine
Learning
Short
Course
|
August
11-‐15
2014
13
NMF
for
Text
• Example
– Documents:
1. “one
fish
two
fish”
2. “red
fish
blue
fish”
3. “old
fish
new
fish”
4. “some
are
red
and
some
are
blue”
5. “some
are
old
and
some
are
new”
– Terms:
“one”,
“two”,
“fish”,
“red”,
“blue”,
“old”,
“new”,
“some”,
“are”,
“and”
Machine
Learning
Short
Course
|
August
11-‐15
2014
14
NMF
for
Text:
X
(term-‐document
matrix)
Document
1
2
3
4
5
“one”
1
“two”
1
“fish”
2
2
2
“red”
1
1
“blue”
1
1
Term
“old”
1
1
“new”
1
1
“some”
2
2
“are”
2
2
“and”
1
1
Machine
Learning
Short
Course
|
August
11-‐15
2014
15
NMF
for
Text:
W
(dic1onary
matrix)
Metafeature
“one”
+
“fish”
“red”
+
“old”
+
“some”
+
“are”
+
“two”
“blue”
“new”
0.5
·∙
“and”
“one”
1
“two”
1
“fish”
1
“red”
1
Term
“blue”
1
“old”
1
“new”
1
“some”
1
“are”
1
“and”
0.5
Machine
Learning
Short
Course
|
August
11-‐15
2014
16
NMF
for
Text:
H
(coefficient
matrix)
Document
1
2
3
4
5
“one”
+
“two”
1
“fish”
2
2
2
Metafeature
“red”
+
“blue”
1
1
“old
+
new”
1
1
“some”
+
“are”
+
0.5
·∙
“and”
2
2
• e.g.
“one
fish
two
fish”
→
“one”
“fish”
“two”
“fish”
=
1דone”
+
1× “two”+
2×
“fish”
OR
=
1×(“one”
+
“two”)
+
2×
“fish”
Machine
Learning
Short
Course
|
August
11-‐15
2014
17
NMF
for
Text
• Metafeatures
in
dic1onary
matrix
𝑊
may
reveal
interes1ng
paTerns
in
the
data
– Posi1vity
of
metafeatures
helps
with
interpretability
– Groupings
of
words
in
metafeatures
onen
occur
together
in
the
same
document
• e.g.
“red”
and
“blue”
or
“old”
and
“new”
Machine
Learning
Short
Course
|
August
11-‐15
2014
18
NMF
for
Text
• e.g.
Text
from
news
from
business
sec1on
– 2500
ar1cles,
50
authors
– 948
terms
aner
pre-‐processing
(stemming,
stop
word
removal,
removal
of
infrequent
terms)
– Apply
NMF
factoriza1on
with
𝐾=25
– Metafeatures
in
dic1onary
factor
𝑊
roughly
correspond
to
topics
within
the
text
– Representa1on
of
text:
948
terms
à
25
topics
Machine
Learning
Short
Course
|
August
11-‐15
2014
19
NMF
for
Text
Ford Motor Co. Thursday announced sweeping
organizational changes and a major shake-up of its
senior management, replacing the head of its
global automotive operations. The moves include
combining Ford's four components divisions into a
single organization with 75,000 employees and $14
billion in revenues, and a consolidation of the
automaker's vehicle product development centers to
three from five.
à {
“ford”
“motor”
“thursday”
“announc”
“chang”
“major”
“senior”
“manag”
“replac”…
}
Machine
Learning
Short
Course
|
August
11-‐15
2014
20
NMF
for
Text
Metafeature
1
Metafeature
2
Metafeature
3
Metafeature
4
cargo
0.47
internet
0.43
china
0.73
plant
0.47
air
0.47
comput
0.42
beij
0.31
worker
0.35
airline
0.24
corp
0.30
chines
0.30
uaw
0.24
servic
0.18
use
0.29
state
0.21
strike
0.21
kong
0.16
system
0.20
offici
0.20
ford
0.19
hong
0.16
microsoE
0.19
said
0.19
part
0.17
aircraE
0.13
soEware
0.18
trade
0.14
local
0.15
airport
0.13
inc
0.16
foreign
0.13
auto
0.15
flight
0.12
technolog
0.16
unite
0.11
said
motor
0.14
industri
0.16
truck
0.13
network
0.15
chrysler
0.13
product
0.13
work
0.13
servic
0.13
automak
0.13
busi
0.11
union
0.13
contract
0.13
0.11
Machine
Learning
Short
Course
|
August
11-‐15
2014
21
NMF
for
Images
Machine
Learning
Short
Course
|
August
11-‐15
2014
22
NMF
for
Images
Machine
Learning
Short
Course
|
August
11-‐15
2014
23
NMF
for
Images
≈ +
+
+
+
+
+
+
+
Machine
Learning
Short
Course
|
August
11-‐15
2014
24
# NMF in R
# install.packages("NMF") # nmf
library(NMF)
V <- scale(data, center = FALSE, scale = colSums(V))
k = 20
res <- nmf(V,k)
W <- basis(res) # get dictionary matrix W
H <- coef(res) # get dictionary matrix H
V.hat <- fitted(res) # get estimate W*H
Machine
Learning
Short
Course
|
August
11-‐15
2014
25
Text
Classifica1on
• Naïve
Bayes
– Simple
algorithm
based
on
Bayes
rule
from
sta1s1cs
– Uses
the
bag-‐of-‐words
model
for
documents
– Has
been
shown
to
be
very
effec1ve
for
text
classifica1on
Machine
Learning
Short
Course
|
August
11-‐15
2014
26
Naïve
Bayes
• NB
chooses
the
most
likely
class
label
based
on
the
following
assump1on
about
the
data:
– Independent
feature
(word)
model
–
presence
of
any
word
in
document
is
unrelated
to
the
presence/
absence
of
other
words
• This
assump1on
makes
it
easier
to
combine
the
contribu1ons
of
features,
don’t
need
to
model
interac1ons
between
words
• Even
though
this
assump1on
rarely
hold,
NB
s1ll
works
well
in
prac1ce
Machine
Learning
Short
Course
|
August
11-‐15
2014
27
Naïve
Bayes
• Compute
𝑃𝑟𝑜𝑏(𝑌=𝑗 |𝑋)
for
each
class
𝑗
and
choose
class
with
greatest
probability
• Bayesian
classifiers
𝑃𝑟𝑜𝑏𝑌𝑋 = 𝑃𝑟𝑜𝑏(𝑌)𝑃𝑟𝑜𝑏(𝑋|𝑌)/𝑃𝑟𝑜𝑏(𝑋)
• For
Naïve
Bayes
𝑌 =argmax┬𝑌 𝑃𝑟𝑜𝑏(𝑌)∏𝑗=1↑𝑑▒𝑃𝑟𝑜𝑏(𝑋↓𝑗 |𝑌)
– 𝑃𝑟𝑜𝑏(𝑌),
𝑃𝑟𝑜𝑏𝑋↓𝑗 𝑌
es1mated
using
training
data
Machine
Learning
Short
Course
|
August
11-‐15
2014
28
Naïve
Bayes
• Advantages:
– Does
not
require
a
large
training
set
to
obtain
good
performance,
especially
in
text
applica1ons
– Independence
assump1on
leads
to
faster
computa1ons
– Is
not
sensi1ve
to
irrelevant
features
• Disadvantages:
– Independence
of
features
assump1on
– Good
classifier,
but
poor
probability
es1mates
Machine
Learning
Short
Course
|
August
11-‐15
2014
29
Author
iden1fica1on
• Collec1on
of
poems
–
William
Shakespeare
or
Robert
Frost?
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim…
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date.
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature's changing course, untrimmed;…
Machine
Learning
Short
Course
|
August
11-‐15
2014
30
Author
iden1fica1on
install.packages("tm") # text mining
library(tm) # loads library
# shakespeare
s.dir = "shakespeare"
s.Docs <- Corpus(DirSource(directory=s.dir,
encoding="UTF-8"))
# frost
f.dir = "frost"
f.Docs <- Corpus(DirSource(directory=f.dir,
encoding="UTF-8"))
Machine
Learning
Short
Course
|
August
11-‐15
2014
31
cleanCorpus<-function(corpus){
# apply stemming
corpus <-tm_map(corpus, stemDocument, lazy=TRUE)
# remove punctuation
corpus.tmp <- tm_map(corpus,removePunctuation)
# remove white spaces
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
# remove stop words
corpus.tmp <-
tm_map(corpus.tmp,removeWords,stopwords("en"))
return(corpus.tmp)
}
Machine
Learning
Short
Course
|
August
11-‐15
2014
32
d.docs <- c(s.docs, f.docs) # combine data sets
d.cldocs <- cleanCorpus(d.docs) # preprocessing
# forms document-term matrix
d.tdm <- DocumentTermMatrix(d.cldocs)
# removes infrequent terms
d.tdm <- removeSparseTerms(d.tdm,0.97)
> dim(d.tdm) # [ #docs, #numterms ]
[1] 264 518
> inspect(d.tdm) # inspect entries in document-term
matrix
Machine
Learning
Short
Course
|
August
11-‐15
2014
33
# exploring the data
# terms appearing > 55 times in shakespeare’s poems
> findFreqTerms(s.tdm,55)
[1] "and" "but" "doth" "eye" "for" "heart" "love"
"mine" "sweet" "that" "the" "thee" "thi" "thou"
"time" "yet"
# terms appearing > 55 times in frost’s poems
> findFreqTerms(f.tdm,55)
[1] "and" "back" "but" "come" "know" "like" "look"
"make" "one" "say" "see" "that" "the" "they" "way"
"what" "with" "you"
Machine
Learning
Short
Course
|
August
11-‐15
2014
34
# exploring the data
# identify associations between terms - shakespeare
> findAssocs(s.tdm, "winter", 0.2)
winter
summer 0.50
age 0.40
youth 0.34
like 0.24
old 0.23
beauti 0.21
seen 0.21
Machine
Learning
Short
Course
|
August
11-‐15
2014
35
# exploring the data
# identify associations between terms - frost
> findAssocs(f.tdm, "winter", 0.5)
winter
climb 0.66
town 0.62
toward 0.57
side 0.55
black 0.53
mountain 0.52
Machine
Learning
Short
Course
|
August
11-‐15
2014
36
# assign class labels to each document,
# based on the document author
class.names = c('shakespeare','frost')
d.class = c(rep(class.names[1], nrow(s.tdm)),
rep(class.names[2], nrow(f.tdm)))
d.class = as.factor(d.class)
> levels(d.class)
[1] "frost" "shakespeare“
Machine
Learning
Short
Course
|
August
11-‐15
2014
37
# separate data into training and test sets
set.seed(123) # set random seed
train_frac = 0.6 # fraction of data for training
train_idx = sample.int(nrow(d.tdm), size =
ceiling(nrow(d.tdm) * train_frac),
replace = FALSE);
train_idx <- sort(train_idx)
test_idx <- setdiff(1:nrow(d.tdm), train_idx)
d.tdm.train <- d.tdm[train_idx,]
d.tdm.test <- d.tdm[test_idx,]
d.class.train <- d.class[train_idx]
d.class.test <- d.class[test_idx]
Machine
Learning
Short
Course
|
August
11-‐15
2014
38
# separate data into training and test sets
> d.tdm.train
<<DocumentTermMatrix (documents: 159, terms: 518)>>
Non-/sparse entries : 6167/76195
Sparsity : 93%
Maximal term length : 9
Weighting : term frequency (tf)
> d.tdm.test
<<DocumentTermMatrix (documents: 105, terms: 518)>>
Non-/sparse entries : 4578/49812
Sparsity : 92%
Maximal term length : 9
Weighting : term frequency (tf)
Machine
Learning
Short
Course
|
August
11-‐15
2014
39
# CART
install.packages("rpart") # install cart package
library(rpart) # load library
d.frame.train <- data.frame(as.matrix(d.tdm.train));
d.frame.train$class <- as.factor(d.class.train)
treefit <- rpart(class ~., data = d.frame.train)
> summary(treefit)
Variables actually used in tree construction:
[1] doth eyes green grow let thee which
Machine
Learning
Short
Course
|
August
11-‐15
2014
40
Decision
Tree
result
plot(treefit, uniform=TRUE)
text(treefit, use.n=T)
Machine
Learning
Short
Course
|
August
11-‐15
2014
41
• William
Shakespeare
or
Robert
Frost?
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim…
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date.
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimmed;
And every fair from fair sometime declines,
By chance, or nature's changing course, untrimmed;…
Machine
Learning
Short
Course
|
August
11-‐15
2014
42
# CART
Node number 1: 159 observations, complexity param=0.3947368
predicted class=shakespeare expected loss=0.4779874 P(node) =1
class counts: 76 83
probabilities: 0.478 0.522
left son=2 (120 obs) right son=3 (39 obs)
Primary splits:
thee < 0.0007022472 to the left, improve=21.14, (0 missing)
thi < 0.01323529 to the left, improve=21.14, (0 missing)
thou < 0.003511236 to the left, improve=19.58, (0 missing)
doth < 0.0007022472 to the left, improve=16.21, (0 missing)
love < 0.01906318 to the left, improve=14.89, (0 missing)
Surrogate splits:
thou < 0.003511236 to the left, agree=0.906, (0 split)
thi < 0.0007022472 to the left, agree=0.899, (0 split)
art < 0.005088523 to the left, agree=0.836, (0 split)
thine < 0.0007022472 to the left, agree=0.824,(0 split)
hast < 0.009433962 to the left, agree=0.805, (0 split)
Machine
Learning
Short
Course
|
August
11-‐15
2014
43
# CART
predclass <- predict(treefit1, d.frame.test)
colNames = colnames(predclass)
d.class.pred <-
as.factor(colNames[max.col(predclass)])
tree.table <- table(d.class.pred, d.class.test)
> tree.table
actual
predicted frost shakespeare
frost 55 12
shakespeare 1 37
Machine
Learning
Short
Course
|
August
11-‐15
2014
44
# CART
errorRate<-function(table){
TP = table[1,1]; # true positives
TN = table[2,2]; # true negatives
FP = table[1,2]; # false positives
FN = table[2,1]; # false negatives
error_rate = (FP + FN)/(TP + TN + FP + FN)
return(error_rate)
}
> errorRate(tree.table)
[1] 0.1238095
Machine
Learning
Short
Course
|
August
11-‐15
2014
45
COME unto these yellow
sands,
And then take hands:
Court'sied when you have, How countlessly they congregate
and kiss'd,-- O'er our tumultuous snow,
The wild waves whist,-- Which flows in shapes as tall as
Foot it featly here and trees
there; When wintry winds do blow!–
And, sweet sprites, the As if with keenness for our fate,
burthen bear. Our faltering few steps on
Hark, hark! To white rest, and a place of
Bow, wow, rest
The watch-dogs bark: Invisible at dawn,--
Bow, wow. And yet with neither love nor
Hark, hark! I hear hate,
The strain of strutting Those stars like some snow-white
chanticleer Minerva's snow-white marble eyes
Cry, Cock-a-diddle-dow! Without the gift of sight.
Machine
Learning
Short
Course
|
August
11-‐15
2014
46
COME unto these yellow
sands,
And then take hands:
Court'sied when you have, How countlessly they congregate
and kiss'd,-- O'er our tumultuous snow,
The wild waves whist,-- Which flows in shapes as tall as
Foot it featly here and trees
there; When wintry winds do blow!–
And, sweet sprites, the As if with keenness for our fate,
burthen bear. Our faltering few steps on
Hark, hark! To white rest, and a place of
Bow, wow, rest
The watch-dogs bark: Invisible at dawn,--
Bow, wow. And yet with neither love nor
Hark, hark! I hear hate,
The strain of strutting Those stars like some snow-white
chanticleer Minerva's snow-white marble eyes
Cry, Cock-a-diddle-dow! Without the gift of sight.
True
Author:
Shakespeare
True
Author:
Frost
Predicted:
Frost
Predicted:
Shakespeare
Machine
Learning
Short
Course
|
August
11-‐15
2014
47
# KNN
library(class)
knn_res <- knn(d.tdm.train, d.tdm.test,
d.class.train, k = 5, prob=TRUE)
knn.table <- table(knn_res, d.class.test,
dnn = list('predicted','actual'))
> knn.table
actual
predicted frost shakespeare
frost 56 33
shakespeare 0 16
> errorRate(knn.table)
[1] 0.3142857
Machine
Learning
Short
Course
|
August
11-‐15
2014
48
# naive bayes
nb_classifier <- naiveBayes(as.matrix(d.tdm.train),
d.class.train, laplace = 1)
res <- predict(nb_classifier, as.matrix(d.tdm.test),
type = "raw", threshold = 0.5)
> res
frost shakespeare
[1,] 2.265614e-244 1.000000e+00
[2,] 2.285289e-165 1.000000e+00
[3,] 5.696532e-67 1.000000e+00
…
[104,] 1.000000e+00 0.000000e+00
[105,] 1.000000e+00 0.000000e+00
Machine
Learning
Short
Course
|
August
11-‐15
2014
49
# naive bayes
> nb_classifier$apriori # breakdown of training data
d.class.train
frost shakespeare
77 82
Machine
Learning
Short
Course
|
August
11-‐15
2014
50
errorRate<-function(table){
TP = table[1,1]; # true positives
TN = table[2,2]; # true negatives
FP = table[1,2]; # false positives
FN = table[2,1]; # false negatives
error_rate = (FP + FN)/(TP + TN + FP + FN)
return(error_rate)
}
> errorRate(res.table)
[1] 0.1619048
Machine
Learning
Short
Course
|
August
11-‐15
2014
51
NMF
for
Text
•
'cargo'
'air'
'airlin'
'servic'
'kong‘
'hong'
'aircran'
'airport'
'flight’
(
0.4711
0.4696
0.2349
0.1772
0.1648
0.1583
0.1328
0.1271
0.1245
)
•
'internet'
'comput'
'corp'
'use'
'system'
'microson'
'sonwar‘
'inc'
'technolog'
'industri'
'network'
'product'
'servic'
'busi‘
(0.4285
0.4165
0.2990
0.2885
0.1958
0.1883
0.1776
0.1630
0.1618
0.1565
0.1519
0.1347
0.1320
0.1146)
• 'china'
'beij'
'chines'
'state'
'offici'
'said'
'trade'
'foreign‘
'unite‘
(
0.7297
0.3059
0.3034
0.2089
0.2038
0.1884
0.1400
0.1337
0.1147
)
•
'plant'
'worker'
'uaw'
'strike'
'ford'
'part'
'local'
'auto‘
'said'
'motor'
'truck'
'chrysler'
'work'
'automak'
'union‘
'contract'
'agreement'
'three'
'mich‘
(
0.4729
0.3485
0.2438
0.2141
0.1877
0.1692
0.1498
0.1452
0.1382
0.1310
0.1305
0.1291
0.1281
0.1264
0.1261
0.1130
0.1044
0.1040
0.1023)
Machine
Learning
Short
Course
|
August
11-‐15
2014
52
# CART
Node number 1: 159 observations, complexity param=0.3947368
predicted class=shakespeare expected loss=0.4779874 P(node) =1
class counts: 76 83
probabilities: 0.478 0.522
left son=2 (120 obs) right son=3 (39 obs)
Primary splits:
thee < 0.5 to the left, improve=21.14719, (0 missing)
thi < 0.5 to the left, improve=20.35459, (0 missing)
thou < 0.5 to the left, improve=19.57953, (0 missing)
doth < 0.5 to the left, improve=16.20745, (0 missing)
tree < 0.5 to the right, improve=13.91526, (0 missing)
Surrogate splits:
thou < 0.5 to the left, agree=0.906, adj=0.615, (0 split)
thi < 0.5 to the left, agree=0.899, adj=0.590, (0 split)
art < 0.5 to the left, agree=0.830, adj=0.308, (0 split)
thine < 0.5 to the left, agree=0.824, adj=0.282, (0 split)
hast < 0.5 to the left, agree=0.805, adj=0.205, (0 split)
Machine
Learning
Short
Course
|
August
11-‐15
2014
53
Sample
R
code
> Auto=read.table("Auto.data")
> fix(Auto)
> dim(Auto)
[1] 392 9
> names(Auto)
[1] "mpg" "cylinders " "displacement" "horsepower "
[5] "weight" "acceleration" "year" "origin"
[9] "name"
Machine
Learning
Short
Course
|
August
11-‐15
2014
54