Machine learning methods for data security
Author: Jzsef Hegeds
Supervisor:
Prof. Pekka Orponen
Instructor:
Doc. Amaury Lendasse
2012
#
#
k1
0%
1 n 10
40%
0.002%
103
98.6%
74.37%
86%
97.11%
40%
85%
2.5%
0.3%
1%
3.80%
0.004%
0.02%
105 q
4000
71%
21
105
N
A tC1 , C2 , ..., Ci , ..., CN u.
A
Ci
pn
pn
tC1 , C2 , ..., Ci , ..., CN u
Ci
Ci t..., ppn , hn q, ...u
hn
gn ppn , hn q
P
pn P P
P
42
59
pn
hn q
pn
3
8
h
1000
Dip
p
Di
yppq |
p
i Di |
pp, hn q
pp, hn q
p
Dip
Ci
p
Dip
pPP
59
59
Ci u H
p PP
thn | pp , hn q P
p
i
Di59
x t 103
30
x
yppq |
Dip | N 1000
1000
i
600
y
p
x
59
5000
ypxq |ti : |Di59 | xu|
5000
number of (unique) hashes which occur exactly in a given number of files
(not neccessarily the same files)
restricted to clean
samples only
10
number of hashes
10
10
10
10
X: 335
Y: 27
10
10
100
200
300
400
500
600
number of files
700
800
900
59
335
27
59
59
59
ypxq te P
Di : |ti : e P Di u| xu .
pm
hm
vm
Abal
S
S
S
ppm , hm , vm q
m
vm P V
Abal
v
ppm , hm q ppn , hn q
|V | 24
ppn , hn q P Ci
ppm , hm , vm q P S
Abal
86000
A86k
86000
A86k
100
30
Abal
v
1700
Abal
Abal
A86k
vm
v
v
y
v
v
v
x
i
x
zi
i
zi
AV T
32683
x
zi x
32683
A86k
AV T
AV T
32683
AV T
2500
01
T
AV01
zi
T
AV01
i
i
i
T
AV T AV01
tCi P AV T : zi 0u Y tCi P A : zi 10u
zi
zi 0
zi 0
zi 10
zi 1
10
A86k
Alarge
Alarge
Alarge
gn ppn , hn q
Alarge
0 Jij 1
Jij
Ci
Cj Ci
Jij
i
Dip
i
p
Dip
Dip
Di
|Di X Dj |
cos
Jij
a
.
|Di ||Dj |
p
p
cos
Jij
1
D
g pg1 , g2 , . . . , gl , . . . , gL q,
i Di
L |D|.
01
g
Di
xi ,
xi pxi1 , ..., xil , ..., xiL q.
xil
xil
#
1 gl P Di
.
0 gl R Di
cos
Jij
L
cos
Jij
xi
x| xj
cos
Jij
i .
xi xj
xi
xi
xi {xi
xj
N 105
xi {xi
xi {xi
Di
i, j
Jac
Jij
|Di X Dj |
.
|Di Y Dj |
xil
xil
#
wl
gl P Di
gl R Di
wl
wl logpN {Nl q
N
Nl
gl
logpN {0q 0 0
wl log
N
,
Nl ` 1
logpN {0q 0 0
xi
x
i Rxi ,
R
EpR| Rq I
dL
x
|i x
j
0
E
d1
x|i xj
j q Epx|i R| Rxj q x|i EpR| Rqx|j x|i xj |Di X Dj |.
Ep
x|i x
x
|i x
j
|
x
i x
j
x|i xj
|Di X Dj |
d
d 4000
k Op2 logpnqq
1
1`
N 106
L
N 105
R
xi
d
x
i
r
x i sk
k1
d
r
x i sk 0
x
i
k
xi
RN Dp, q
N p, q
RN D
k1
d
r RN Dp 0,
r
xi sk r
x i sk ` r
g
g
x
i
x
i Rxi
j P Di
En
D
d1 q
E
n
#
En
n pEq
E,
: g g
|E|
g
M pDi q
p pDi qq
Jac |
Jz
ij
Jac
Jz
ij
pM pDi q Y M pDj qq X M pDi q X M pDj q|
.
|
n pM pDi q Y M pDj qq|
Jac
Jij
pDi q
3
x
tw1 , w2 , ..., wM u
sij
x
W H
r
}wx}
W
r
w
x
sij ai aj ai
r
w
Tp
wi hi px wi q
ai
hi N
k1
ai e2
ak
}xwi }2
sij sij ` p1 qai aj .
105
Jij
j
j
M pjq
k
M k pjq
M k pjq
M k pjq
Jij
k
j
M k pjq
Jij s
Jij s
M k pjq
M k pjq
k1
d1N N
R
R
R
pq
`
1`
p1
q`
1`
Abal
i
x
100
100
p P Pi
x
p
i
Pi
p 1 p
x
i
pPPi |Di |
i
x
,
} pPPi |Dip |1 x
p
i }
Ci
100
100
|Dip | Dip
Abal
Abal
Abal
Jij
1
}x qk }
N k xPQ
k
Qk
A86k
2000
1
100
1.3
A86k
A86k
10000
0.6
1.1
50
Tp 50
0.8
5
Tp 50 N 10000
0.2
0.8 Tp 50 N 10000
0.8 0.2
0.8
2700
20
k 2700
0.363
0.328
0.8
1.2
0.2
0.8 Tp 50 N 10000
1.2
0.8 Tp 50 N 10000
1.2
0.2
1{0.15 6.6
1.5
1
0.2
0.8 Tp 50 N 10000
1
0.2
0.8 Tp 50 N 10000
70
0.71
" 0.8
1.2
k 70
0.63
1.2
Tp
1
7
k 142
0.2
0.8 Tp 200 N 86000
N 86000
142
28
100
0.67
0.60
59
16384
4096
unpredictable
1024
Amount
256
2048
1024
512
256
true negatives
256
64
16
4
1
16440
12646
9728
7483
5756
1024
256
64
16
0
false negatives
true positives
false positives
10
20
30
40
50
60
Number k of nearest neighbors
70
80
90
k
T
AV01
6000{18612 32%
300
s0
16{2441 0.65%
k 100
100
Alarge
Ntrain q
Nval
112000
2 105
d 4000
40%
d 4000
d 4000
500
104
10 q
5
Nval
k
0
1500
1500
1500
1500
1500
15001 6.66 104 0.0666%
x
15001 6.66 104
30
2{10073 2 104
45.5%
56%
2 104
10%
45.5%
11
1
0.9
true positive rate
0.8
0.7
2009.01
2009.02
2009.03
2009.04
2009.05
2009.06
2009.07
2009.08
2009.09
2009.10
2009.11
0.6
0.5
0.4
0.3
0.2 4
10
10
false positive rate
10
10
11
k
k1
d 1000
p 59
s0
11
11
103
10%
training
training
Dclean
Dmalware
validation
validation
Dclean
Dmalware
D
4000
training
training
Btrain Dclean
Y Dmalware
validation
validation
Dclean
Y Dmalware
Bval
|Btrain | 200
|Bval | 200
x 1001
y
x 1001
s
0
s 0.1
s 0.1
s0
s0
s 0.1
s0
s0
ri ti ti
15000
59
Dip
Di3
Dip
i
Di59
i
91
91
ti
0
1
i
18%
91
15000
1
500`1
a
ti
ti
ti a
xi ` b
x
i
Di59
500
a
15000
i pti ti q
ri ti ti
0
500 ` 1
59
59
x
i
x
i
1
x
i
a
Di
ti
Di
91
x
i
91
OpkN q
OpN 2 q
4
11%
OpM q
M
OpN q
10%
103
50%
50%
59
Methodology for Behavioral-based Malware Analysis and Detection
using Random Projections and K-Nearest Neighbors Classifiers
Jozsef Hegedus, Yoan Miche, Alexander Ilin and Amaury Lendasse
Department of Information and Computer Science,
Aalto University School of Science,
FI-00076 Aalto, Finland
AbstractIn this paper, a two-stage methodology
to analyze and detect behavioral-based malware is
presented. In the first stage, a random projection is
decreasing the variable dimensionality of the problem
and is simultaneously reducing the computational
time of the classification task by several orders of
magnitude. In the second stage, a modified K-Nearest
Neighbors classifier is used with VirusTotal labeling
of the file samples. This methodology is applied to
a large number of file samples provided by F-Secure
Corporation, for which a dynamic feature has been
extracted during DeepGuard sandbox execution. As
a result, the files classified as false negatives are used
to detect possible malware that were not detected in
the first place by VirusTotal. The reduced number of
selected false negatives allows the manual inspection
by a human expert.
I. Introduction
Malware detection has been the subject of a large
number of studies (see [1], [2], [3] and [4], [5], [6], [7],
[8]), for example the work of Bailey [9] using signaturebased malware detection approach has shown that recent
malware types require additional information in order to
obtain a good detection.
In this paper, an approach based on the extraction of
dynamic features during sandbox execution is used, as
suggested in [7]. In order to measure similarities between
executable files, the Jaccard Index is used to measure the
similarities between hash values (encoding the dynamic
feature values obtained from the sandbox). The hash
values are transformed into a large number of binary
values which could be used to compute the Jaccard Index
(see [10] for original work in French or [11] in English).
Unfortunately, the dimensionality of such variable space
does not allow the use of traditional classifiers in a
reasonable computational time.
A two-stage methodology is proposed to circumvent
this dimensionality problem. In the first stage, a random
projection is decreasing the variable dimensionality of
the problem and is simultaneously reducing the computational time by several orders of magnitude. In the
second stage, a modified K-Nearest Neighbors classifier
is used with VirusTotal [12] labeling of the file samples.
This two-stage methodology is presented in section III.
The practical implementation of the methodology and
the results are discussed in section IV. The different
parameters (the random projection dimension and the
number of nearest neighbors) are also analysed in this
section.
As a global result, the methodology enables to identify
the false negatives from the classification. Such samples
can then be used to detect possible malware that were
not detected in the first place by the VirusTotal labeling.
Thanks to the methodology, the reduced number of
identified false negatives allows for a manual inspection
by a human expert.
Indeed, without this pruning of possibly malicious
samples by the presented methodology, a manual inspection will not be possible since reliable experts are scarce
and their availability is highly limited.
Using the proposed methodology and the know-how
of one F-Secure Corporation expert, it has been possible
to extract 24 malware candidates out of 2441 original
candidates from which 25% are surely malicious and
50% which are probably malicious, have to be further
investigated in order to obtain a decisive classification.
In section II, the data gathering and sample labeling
are described. Section III presents the two-stage methodology while section IV shows the practical implementation, the results and the analysis of the results.
II. Behavioral Data Gathering and Sample
labeling
The data set used in this paper is focused on behaviorbased malware analysis and detection. The former approach of signature-based malware detection cannot be
considered as sufficient anymore for reliable detection
[9], [7]. Be it because of the development of polymorphic
and metamorphic malware or the approach of flash
worms who only do some reconnaissance on the
machines/network they scan for future deployment of
targeted attacks , the need for execution level identification is important.
A. Sandboxing and Extracting Behavioral Features
In this spirit, a currently popular approach [7], [6] is
to sandbox the execution of the malware and analyze
behavioral data extracted during the execution.
B. Obtaining the Sample labeling
The VirusTotal [12] online analysis tool provides a
simple interface for sample submission, returning a list
of up to 43 (depending on the sample nature: executable,
archive. . . ) mainstream anti-virus software detection results. Among the most widely used and known are FProt, F-Secure, ClamAV, Antivir, AVG, BitDefender,
eSafe, Avast, McAffee, NOD32, Norman, Panda, Symantec, TrendMicro, VirusBuster. . . See the VirusTotal web
site for the full list of used engines [12].
The result of the submission of a sample file is the
number of engines which detected the sample as malware. Figure 1 is a histogram of the detection levels
for the set of 32683 samples used in this paper. As can
be seen, a large proportion of the set is detected by at
least one engine as malware. Less than 2500 samples are
actually not detected by any engine.
In order to make the problem a binary classification
one (i.e. identifying whether a sample should be considered malware or clean), an a priori and arbitrary
threshold has been set on the amount of engines detecting a sample as malware. It is considered that for
a sample i, if the amount mi of engines identifying the
sample as malware is such that 0 < mi < 11, then
the sample is discarded. The disadvantage is that these
samples are not considered in the whole methodology
and therefore not classified. Nevertheless, they have also
2500
2000
Number of samples
It has recently been demonstrated in [8] that the use of
public sandbox submission systems might reveal network
information regarding the sandbox machine identity.
Through submission of a decoy sample by an attacker,
it becomes possible to blacklist the hosts on which are
sandboxed the samples and have the malware circumvent
the sandbox execution and forth detection.
The Norman sandbox development kit [13] released in
2009 enables security companies to gather the behavioral
data obtained during sandboxed execution and analyze
that data with a custom engine. This avoids the pitfall
of a publicly available sandbox machine mentioned.
The results in this paper were obtained on the data of
32683 samples collected by F-Secure Corporation. The
samples data were produced by F-Secure by running the
samples through their sandbox engine [14], [15], [16],
which resulted in large numbers of feature-value pairs
extracted for each sample. Individual features may have
significant number of distinct values, and the values
come in the form of hashes. The data cannot be considered complete, as the sandbox, for instance, may not
be able to run some of the samples correctly or may miss
relevant execution paths.
The samples were labeled using an online sample
analysis tool explained in the next section.
1500
1000
500
0
-5
10
15
20
25
30
35
Number of engines detecting malware
40
45
Figure 1. Histogram made using 32683 executable samples and
querying from www.virustotal.com how many anti-virus engines
raise a flag for each sample. Thus for each sample k a number mk
is obtained. For a given value x, on the x-axis, the y-axis shows for
how many samples k it is true that mk = x.
no influence on the rest of the data set and the final
classification results.
This is equivalent to setting a certainty threshold on
the sample analysis, above which it can be considered
as indeed malware (and no more a set of false positives
from mi different engines). Therefore, samples with a
number mi of detecting engines strictly above 10 are
kept and considered as malware (with a relatively high
probability), and samples with 0 detecting engines are
kept and considered as unpredictable (and possibly
clean).
Figure 2 illustrates the pruned set of samples, with
only samples for which mi = 0 or mi > 10 are kept,
which amounts to 21053 (out of the original set of
32683): 18612 considered as malware, and 2441 as
possibly clean.
It is clear that flagging the 2441 samples for which
mi = 0 as possibly clean is likely to hide a certain
amount of false negatives (VirusTotal clearly states that
mi = 0 should in no way be considered as meaning
clean). The meta goal of this paper is to actually
identify such samples which are potential false negatives,
using a methodology based on the Jaccard similarity
[11], [10] measure and K-Nearest Neighbors classifiers.
III. Methodology
The overall process can be summarized by Figure 3,
with the dynamic feature extraction described in the
previous section, followed by the actual methodology
to identify potential false negatives, using a Random
Projection approach and K-Nearest Neighbors classifiers
(described in detail in sections III-C and III-B).
2500
Similarly, the Jcosine cosine similarity is given by
- i
-A Ai
i,i
Jcosine
=
.
|Ai | |Ai |
Number of samples
2000
Note that the Jcosine cosine similarity is expressed as a
scalar product.
Denote by
1500
1000
A=
i=1
500
0
-5
10
15
20
25
30
35
Number of engines detecting malware
40
45
Figure 2.
The mk distribution for the samples used for this
histogram is identical to Figure 1 with the important difference
that samples such that 0 < mi < 11 are discarded. Here 2441
samples are depicted that can be considered as clean (mi = 0),
and 18612 samples that can be considered as malicious (mi > 10).
Dynamic
Behavioral
Feature
Sample
(2)
Malware
Random
Projection
Clean
Figure 3.
Global schematic of the methodology: a sample is
run through the sandbox to obtain a set of dynamic features; the
random projection approach then reduces the dimensionality of
the problem while retaining most of the information conveyed by
the original feature; finally, a K-Nearest Neighbors classifier in the
random projection space gives prediction on the studied sample
being malware or not.
(3)
where N is the total number of samples and D is the
total number of unique hashes seen in all samples.
Then from an ordering of set A, N binary (0,1 valued)
vectors Bi can be constructed, each of K dimensions
such that
- i
(4)
-A Ai - = Bi , Bi ,
- . .2
and -Ai - = .Bi . . Here denotes vector norm and
, denotes scalar product. Since Bi is a binary vector
. .2
(with coordinates 0, 1 only), .Bi . is the number of the
coordinates in Bi that are equal to 1.
So, the normalized scalar product of Bi and Bi gives
the cosine similarity:
i,i
Jcosine
Sandbox
KNN
Ai = {a1 , a2 , ..., aD },
Bi , Bi
..
=. . .
.
.Bi . .
.Bi .
(5)
Using the relationship between Euclidean distance
.
.
.
.
Deuclidean = .Bi Bi .
(6)
. i.
. . = 1 and
and
. .cosine similarity in the case of B
. i.
.B . = 1, it appears that
2
2 Deuclidean
.
(7)
2
From Equation 7 it appears that a classification or
clustering based either on the cosine similarity or on the
Euclidean distance will yield the same result if the norm
of the feature vectors is unity.
Jcosine =
A. Measuring Similarity between Executables
B. K-Nearest Neighbor Classification
In this section, an approach for measuring similarities
between executables is detailed. Let Ai denote the set of
hash values (produced by the sandbox) for file i.
Then, the JJaccard Jaccard similarity between two
executables i, i is calculated as
In this section, a standard method (K-NN, see for example [17], [18], [19], [20]) is described; it can be used to
predict whether an unknown executable is malicious or
benign. The essential assumption of the method is that
malicious (resp. clean) executables are surrounded by
malicious (resp. clean) executables in the D dimensional
Euclidean space spanned by the normalized vectors
i,i
JJaccard
- i
-A Ai |Ai Ai |
(1)
Bi
. i. ,
.B .
(8)
with Bi the binary vectors defined in the previous section. This means that the more hashes two samples have
in common the closer they are in this space (assuming
that the number of hashes in the two samples does not
change).
Let us denote the set of k nearest neighbors of sample
i by Nki . The classification is based on the data provided
by VirusTotal, that is how many anti-virus engines have
considered a given executable as malicious. Let us denote
this number by mi for sample i. In the results section
is examined how well the mi of the neighboring samples
Nki can actually predict if the sample i in question is
malicious or clean.
It is important to mention that to predict if a sample
i is malicious or not, only neighboring samples are used
and not the sample itself. This corresponds to a LeaveOne-Out [21], [22], [23], [24] (LOO) classification rate
when it comes to assessing the accuracy of the K-NN
classifier in the Results section. In [21], [22], it is shown
that the Leave-One-Out estimates well the generalization performances of a classifier if the number of samples
is large enough, which is the case in the experiments.
As the dimensionality of Bi is too large, random projections are used in order to reduce this dimensionality
and therefore reduce the needed computational time
and memory by several orders of magnitude. Random
projections are explained in the following section.
C. Random Projections
as
i,i
Jcosine
i
i
i
Xim = [Xm,1
, Xm,2
, . . . , Xm,d
], Xim N (0, I)
Bi , Bi
..
=. . .
.
.Bi . .
.Bi .
(9)
However, for practical purposes storing the vector Bi is
inconvenient as it requires too much memory (even if
stored as a sparse vector). The reason for this is that D,
the dimensionality of Bi is in the range of a few millions.
In order to alleviate this memory (and the related time)
complexity, random projections are used. For the matter
of projecting to a lower dimensional space, Johnson and
Lindenstrauss [25] have shown that for a set of N points
in d-dimensional space (using an Euclidean norm), there
exists a linear transformation of the data toward a
df -dimensional space, with df O(2 log(N )) which
preserves the distances (and hopefully the topology of
the data) to a 1 factor. Achlioptas [26] has recently
extended this result and proposed a simpler projection
matrix that preserves the distances to the same factor
than the Johnson-Lindenstrauss theorem mentions, at
the expense of a probability on the distance conservation. For theory and other applications of random
projections in machine learning and classification, see for
example [27], [28], [29], [30], [31].
(10)
Xim if m = m , however, if m = m
such that Xim
i
then Xm = Xim . N (0, I) represents a d-dimensional
standard normal distribution for which the covariance
matrix is the identity matrix, I.
Then, for each file i the corresponding random projection is the d-dimensional random vector Yi defined
as
1
1
Yi =
Xim .
(11)
d |Ai | mAi
The scalar product of the random vectors gives the
similarity J, which is a scalar valued random variable.
Using
J i,i = Yi , Yi
(12)
and the definition of Yi , one can see that Pr(J i,i = 1) =
1. Also if file i and i do not have any hashes in common,
i.e. Ai Ai = , then E(J i,i ) = 0.
As an illustrative example, let us calculate -the- exi,i
- ipected
- - similarity,
- E(J - ) by assuming that A =
- i- i
-A - = l and -A Ai - = k. Note that the Jaccard
distance between i and i in this case is k/l. Also, due
to independence
As mentioned earlier the cosine similarity is calculated
To describe the random projection approach, let m
Ai , and
E(Xim , Xim ) = 0 m = m .
(13)
On the other hand, the following scalar product (in
case of matching hashes, m = m ) has the chi-square
distribution:
Xim , Xim 2 (d)
(14)
where 2 (d) denotes the chi-square distribution with ddegree of freedom, whose expectation value is d. Since
only the Xim , Xim terms contribute to E(J i,i ) it can
be deduced that
k
E(J i,i ) =
(15)
l
which agrees with the Jaccard and cosine
in
- - i - similarity
- ithis case. Note that in general if A = -A - then
E(J i,i ) = JJaccard but still E(J i,i ) = Jcosine . Therefore, the Jaccard index is approximated using the cosine
similarity approach defined previously.
IV. Results
In this section, Euclidean distance is used in the ddimensional space spanned by the random projected
representations Yi of the samples. As noted earlier, the
use of Euclidean distance instead of cosine similarity
does not change the results presented in this section as
Pr(J i,i = 1) = 1. The Yi are normalized to unity.
1400
number of detecting engines = 0
number of detecting engines >10
Number of samples
Number of samples
2000
1800
1600
1400
1200
1000
800
600
400
200
0
-5
number of detecting engines = 0
number of detecting engines >10
true positives
1200
1000
800
600
true negatives
200
0
10
15
20
25
30
35
40
0 -5
45
An illustration of the prediction accuracy of the KNN method (see section III-B) is shown in Figure 4, and
described in detail in the following.
i
Let N10
be the set of 10 nearest samples to sample i,
then the prediction of the K-NN method for mi is the
i
mean m
i of values {mi : i N10
} expressed as
- i - 1
m i .
(16)
m
i = -N10 i
i N10
For a given value x on the x-axis, the height of the bar on
y-axis shows for how many samples m
i = x, i.e. y(x) =
|{i : m
i = x}| is true.
The question is how well the number of detecting
engines mi given by VirusTotal compare with their
predicted values, m
i . In order to answer that question,
the samples are divided into two categories: category 1
as supposedly clean (i.e. mi = 0) and category 2 as
supposedly malicious (i.e. mi > 10). They are shown
in Figures 4 and 5. Assuming that m
i = 0 means that
sample i is predicted to be clean and that m
i > 10 that
sample i is predicted to be malicious, there would be
a considerable amount of false positives . The number
of false positives can be reduced by introducing a third
class into the K-NN classifier: unpredictable. The next
section details the results obtained using this additional
third class and a modified K-NN.
B. Accuracy of Modified K-NN Classifier
Figure 5 shows the prediction accuracy of the modified
K-NN classifier. Now, the K-NN classifier has 3 classes:
predicted to be clean, predicted to be malicious, unpredictable.
A sample i is classified as clean if m
i = 0. It is
i
classified as malicious if mi > 10 : i N10
, i.e. if
i
all the 10 nearest neighbors N10 of i are supposedly
malicious. A neighboring sample is considered supposedly malicious if mi > 10, i.e. if it has been flagged
as malicious by more than 10 AV-engines. Furthermore,
Figure 5.
10
15
20
25
30
35
40
45
Prediction accuracy of the modified K-NN classifier.
a sample i is considered to be unpredictable if it does
not fulfill the requirement to be classified as clean or
malicious. In the production of the histogram depicted
in Figure 5, samples that are unpredictable are omitted.
In Figure 5, the concepts of false negative, false positive,
true positive and true negative are illustrated.
Introducing the unpredictable class considerably improves the prediction accuracy for the two other classes.
This improvement is due to the fact that the uncertainty
on the neighbors is used to separate the predictable and
unpredictable samples. An unpredictable sample is a
sample i, such that not all of its neighbors are either
supposedly malicious (i.e. mi > 10) or supposedly
clean (i.e. mi = 0).
Amount
A. Accuracy of K-NN Classifier
Mean of the number of detecting engines of the 10 nearest neighbors
Mean of the number of detecting engines of the 10 nearest neighbors
Figure 4. Illustration of the prediction accuracy of the K-NN
method: Histogram of the number of detecting engines for k = 10
nearest neighbors.
false positives
false negatives
400
16384
4096
1024
256
2048
1024
512
256
256
64
16
4
1
16440
12646
9728
7483
5756
1024
256
64
16
unpredictable
true negatives
false negatives
true positives
false positives
0
10
20
30
40
50
60
70
Number k of nearest neighbors
80
90
100
Figure 6. The entries of the confusion matrix (false positive, false
negative, true positive and true negative) are plotted in this figure
as a function of the parameter k, the number of nearest neighbors.
In addition, the number of unpredictable samples is represented.
C. Influence of the Number of Nearest Neighbors in the
Modified K-NN Classifier on the Confusion Matrix
In Figure 5 are illustrated the notions of false positive, false negative, true positive and true negative. A
prediction for a sample i is considered to be a false
positive if mi > 10 : i Nki and mi = 0 are true
at the same time. This means that all the k-nearestneighbors Nki of sample i are supposedly malicious
(mi > 10 : i Nki ), however, sample i itself is
considered to be supposedly clean (mi = 0). Similarly,
true positive means that mi > 10 : i Nki and
mi > 10 are true for sample i. Furthermore, false
negatives are characterized by mi = 0 : i Nki and
mi > 10, while a true negative is a sample i for which
mi = 0 : i Nki and mi = 0 holds.
The entries of the confusion matrix (false positive,
false negative, true positive and true negative) are plotted in Figure 6 as a function of the parameter k, the
number of nearest neighbors. Sample i is unpredictable
if neither mi = 0 : i Nki nor mi > 10 : i Nki is
true. The number of unpredictable samples increases
monotonically with increasing k, this must be so as
increasing k by one introduces an additional condition
that has to be fulfilled in order for a sample to be
classified as predictable . In fact, if a sample is labeled
as unpredictable for k, it cannot become predictable
for k + l, l > 0.
In Figure 6, one can note that the number of false
and true negatives stops decaying at k = 40. However,
at k = 40 the number of true and false positives are still
decaying at a rapid rate. The reason for this difference
might be that there are much less supposedly clean
samples than supposedly malicious ones. Also, the
cluster size distribution might be different for these two
categories, which could manifest itself in these different
decay behaviors in Figure 6.
Figure 6 can be used to choose the parameter k that
fits the needs of the user of the modified K-NN method.
Furthermore, note the difference in the decay exponents
for true and false positive rates. If k is increased from 2 to
100 the number of true positives decreases from 17150 to
6204, while the number of false positives decreases from
531 to 17. The decrease in the true positives is 64% while
the decrease in false positives is 97%. So if one wants
to increase the true positive/false positive ratio then its
advisable to increase the number of neighbors, k. On the
other hand one should not forget that by increasing k one
also increases the number of unpredictable samples. In
order to limit this amount of unpredictable samples, the
number of nearest neighbors k to use has been chosen
as 11 for the final detection of the false negatives.
D. Influence of the Random Projection Dimension on
the Confusion Matrix
In the previous section, the dependency of the confusion matrix with respect to the number of neighbors
is discussed and the dimension of the random projected
vectors is fixed to be d = 300. In this section, the effects
of varying d on the confusion matrix are investigated.
In order to have a very small number of false negatives
and to demonstrate the influence of d, the number of
neighbors k is chosen to be 30 in this section. Figure
7 shows the dependency of the confusion matrix on
the number of dimensions d of the projected vectors.
Clearly, increasing d improves the results: the number of
unpredictable samples decrease while the true positives
increase and the false positives decrease.
The true and false negatives do not change much with
increasing d. This might be related to the fact that at
k = 30 the decay of true and false negatives in Figure 6
has almost completely stopped. So, even though the low
value of d = 300 might mean that the distances in the
d = 300 dimensional Euclidean random projected space
are noisy compared to the D > 106 dimensional original
space. The samples that are true and false negatives are
insensitive to this noise.
Figure 7 indicates that convergence in all confusion
matrix elements can be reached by using d = 700. By
increasing d even more, no significant improvement is
observed.
The necessity to use the random projection method is
almost unavoidable: if one would like to use the original
space (with dimensionality D > 106 ) the complexity of
the problem (in terms of memory and computational
time) can become an issue as D has been as high as
5.108 in other related experiments. In this situation, if
one wishes to calculate distances between vectors in the
original space then all the data needs to be located in
the memory (since the original space is spanned by all
the hashes produced by the sandbox). Furthermore, here
a set of samples of cardinality of the order of 104 as
been considered. However, future experiments will be on
the scale of 106 samples, where using the original space
might become prohibitive.
The total computational time needed to run the
methodology on the 21053 samples is a few hours using
Python implementation of the random projections and
K-NN. In comparison, without the random projection
approach, the computational time would be estimated
to take few weeks, due to the dimensionality of Bi .
Finally, based on these results one might improve the
previously presented random projection method by using
different number of dimensions d for each pair of distance
calculated. One could treat larger distances with less
accuracy (lower d) while treating smaller distances with
better accuracy (higher d). This is a possible direction
for future research.
E. Manual Analysis by a Human Expert and Further
Work
Using d = 100 projection dimension and a modified KNN with k = 11, 24 false negatives have been extracted
unpredictable
will be investigated and combined with static features
(code signatures, packer information. . . ) extracted from
the samples before sandbox execution. This will be the
natural continuation of the presented work.
true negatives
Acknowledgments
1.3 x 10
1.1
0.9
335
Amount
325
315
3
2.5
false negatives
2
12000
10000
8000
100
80
60
40
0
true positives
false positives
100
200
300
400
500
600
d, dimension of the random projected vectors
700
Figure 7. Dependency of the elements of the confusion matrix with
respect to the number of dimensions d of the projected vectors.
out of the 2441 possibly clean files. This reduced
number allows the manual analysis by a human expert.
According to an F-Secure Corporation expert, 25% of
these 24 files are surely malicious. 50% have a relatively
high probability to be also malicious. The remaining 25%
are considered as clean by the expert.
Even with such a reduced number of candidates,
a human analysis is taking time and has high costs
(especially if the 50% of unsure samples have to be
further investigated). This shows the usefulness of the
presented methodology since it would be impossible to
find enough highly qualified experts to analyze the initial
2441 possibly clean files.
The same methodology will be applied in the future
using different labeling than the one provided by VirusTotal. Also, different dynamic features will be investigated and eventually combined with some static features
(code signatures, packer information. . . ), and possibly
other types of malware in the sample set.
V. Conclusion
In this paper, a robust two-stage methodology has
been introduced in order to both perform classification
of executable files and detect the files with the highest
probability of being false negatives (malware that are
labeled as possibly clean files). It has been shown that
the methodology is not only accurate but is also reducing
by several orders of magnitude the computational time.
This makes the proposed methodology a valid candidate
as a pre-processing tool to provide inputs to forensic
experts in order to detect malwares that have not yet
been detected by the AV engines used in VirusTotal.
Furthermore, this methodology can also be applied to
other labeling. Also, new and different dynamic features
The authors of this paper would like to acknowledge FSecure Corporation for providing the data and software
required to perform this research. Special thanks go to
Pekka Orponen (Head of the ICS Department, Aalto
University), Alexey Kirichenko (Research Collaboration
Manager F-Secure) and Daavid Hentunen (Researcher
F-Secure) for their valuable support and many useful
comments. This work was supported by TEKES as part
of the Future Internet Programme of TIVIT. Part of the
work of Amaury Lendasse and Alexander Ilin is funded
by the Adaptive Informatics Research Centre, Centre of
Excellence of the Finnish Academy.
References
[1] Y. Liu, L. Zhang, J. Liang, S. Qu, and Z. Ni, Detecting
trojan horses based on system behavior using machine
learning method, in Machine Learning and Cybernetics
(ICMLC), 2010 International Conference on, vol. 2, July
2010, pp. 855 860.
[2] I. Firdausi, C. Lim, A. Erwin, and A. Nugroho, Analysis of machine learning techniques used in behaviorbased malware detection, in Advances in Computing,
Control and Telecommunication Technologies (ACT),
2010 Second International Conference on, December
2010, pp. 201 203.
[3] E. Menahem, A. Shabtai, L. Rokach, and Y. Elovici,
Improving malware detection by applying multiinducer ensemble, Computational Statistics & Data
Analysis, vol. 53, no. 4, pp. 1483 1494, 2009.
[4] L. Sun, S. Versteeg, S. Bozta , and T. Yann, Pattern recognition techniques for the classification of malware packers, in Information Security and Privacy, ser.
Lecture Notes in Computer Science, R. Steinfeld and
P. Hawkes, Eds. Springer Berlin / Heidelberg, 2010,
vol. 6168, pp. 370390.
[5] J. Kinable and O. Kostakis, Malware classification
based on call graph clustering, Journal in Computer
Virology, pp. 113, 2011.
[6] A. Srivastava and J. Giffin, Automatic discovery of
parasitic malware, in Recent Advances in Intrusion
Detection (RAID10), ser. Lecture Notes in Computer
Science, S. Jha, R. Sommer, and C. Kreibich, Eds.
Springer Berlin / Heidelberg, 2010, vol. 6307, pp. 97
117.
[7] C. Willems, T.Holz, and F. Freiling, Toward automated
dynamic malware analysis using cwsandbox, IEEE Security and Privacy, vol. 5, pp. 3239, March 2007.
[8] K. Yoshioka, Y. Hosobuchi, T. Orii, and T. Matsumoto,
Vulnerability in public malware sandbox analysis systems, in Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet,
ser. SAINT 10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 265268.
[9] M.Bailey, J. Andersen, Z. Morleymao, and F. Jahanian, Automated classification and analysis of internet
malware, in Recent Advances in Intrusion Detection
(RAID07), 2007.
[10] P. Jaccard, Etude comparative de la distribution florale
dans une portion des alpes et des jura, Bulletin de la
Societe Vaudoise des Sciences Naturelles, vol. 37, pp.
547579, 1901.
[11] P. Tan, M. Steinbach, and V. Kumar, Introduction to
Data Mining. Addison Wesley, 2005.
[12] Hispasec Systemas, Virus total analysis tool, 2011,
http://www.virustotal.com.
[13] Norman ASA, Norman launches sandbox sdk, April
2009, http://www.norman.com/about_norman/press_
center/news_archive/2009/67431/en.
[14] F-Secure Corporation, F-secure deepguard a proactive response to the evolving threat scenario, November 2006, http://www.rp-net.com/online/filelink/340/
20061106%20F-secure_deepguard_whitepaper.pdf.
[15] , F-secure deepguard 2.0 - white paper,
September 2008, http://www.f-secure.com/system/
fsgalleries/white-papers/f-secure_deepguard_2.0_
whitepaper.pdf.
[16] , Information about system control and deepguard, January 2011, http://www.f-secure.com/kb/
2034.
[17] D. Aha and D. Kibler, Instance-based learning algorithms, in Machine Learning, 1991, pp. 3766.
[18] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft,
When is "nearest neighbor" meaningful? in Int. Conf.
on Database Theory, 1999, pp. 217235.
[19] C. Bishop, Neural Networks for Pattern Recognition,
1st ed. Oxford University Press, USA, Jan. 1996.
[20] P. Devijver and J. Kittler, Pattern recognition: A statistical approach. Prentice Hall, 1982.
[21] B. Efron and R. Tibshirani, An Introduction to the
Bootstrap. New York: Chapman & Hall, 1993.
[22] , Improvemenets on cross-validation: The .632+
bootstrap method, Journal of the American Statistical
Association, vol. 92, no. 438, pp. 548560, 1997.
[23] A. Lendasse, V. Wertz, and M. Verleysen, Model selection with cross-validations and bootstraps - application
to time series prediction with rbfn models, Lecture
Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), vol. 2714, pp. 573580, 2003, cited By
(since 1996) 8.
[24] Q. Yu, Y. Miche, A. Sorjamaa, A. Guilln, A. Lendasse,
and E. Sverin, OP-KNN: Method and applications,
Advances in Artificial Neural Systems, vol. 2010, no.
597373, February 2010, 6 pages.
[25] W. B. Johnson and J. Lindenstrauss, Extensions of
lipschitz mappings into a hilbert space, in Conference
in Modern Analysis and Probability, New Haven, USA,
1982, pp. 189206.
[26] D. Achlioptas, Database-friendly random projections:
Johnson-lindenstrauss with binary coins, J. Comput.
Syst. Sci., vol. 66, no. 4, pp. 671687, 2003.
[27] S. Dasgupta, Experiments with random projection, in
Proceedings of the 16th Conference on Uncertainty in
Artificial Intelligence, ser. UAI 00. San Francisco, CA,
USA: Morgan Kaufmann Publishers Inc., 2000, pp. 143
151.
[28] X. Fern and C. Brodley, Random projection for high
dimensional data clustering: A cluster ensemble approach, in International Conference on Machine Learning (ICML03), 2003, pp. 186193.
[29] D. Fradkin and D. Madigan, Experiments with random
projections for machine learning, in KDD 03: Proceedings of the ninth ACM SIGKDD international conference
on Knowledge discovery and data mining. New York,
NY, USA: ACM, 2003, pp. 517522.
[30] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten,
and A. Lendasse, OP-ELM: Optimally-pruned extreme
learning machine, IEEE Transactions on Neural Networks, vol. 21, no. 1, pp. 158162, January 2010.
[31] S. Vempala, The Random Projection Method, ser. DIMACS Series in Discrete Mathematics and Theoretical
Computer Science. American Mathematical Society,
2005, vol. 65.
A Two-Stage Methodology using K-NN and False
Positive Minimizing ELM for Nominal Data
Classification
Yoan Miche1 , Anton Akusok1 , Jozsef Hegedus1 , Rui Nian4
and Amaury Lendasse1,2,3
Department of Information and Computer Science, Aalto University,
FI-00076 Aalto, Finland
2
IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
3
Computational Intelligence Group, Computer Science Faculty,
University Of The Basque Country,
Paseo Manuel Lardizabal 1, Donostia/San Sebastin, Spain
4
College of Information and Engineering,
Ocean University of China,
Qingdao, 266003 China
1
Abstract
In this paper, a methodology for performing binary classification on nominal
data under specific constraints is proposed. The goal is to classify as many
samples as possible while avoiding False Positives at all costs, all within the
smallest possible computational time. Under such constraints, a fast way
of calculating pairwise distances between the nominal data available for all
samples, is proposed. A two-stage decision methodology using two types of
classifiers then provides a fast means of obtaining a classification decision on
a sample, keeping False Positives as low as possible while classifying as many
samples as possible (high coverage). The methodology only has two parameters, which respectively set the precision of the distance approximation and
the final tradeo between False Positive rate and coverage. Experimental
results using a specific data set provided by F-Secure Corporation show that
Email address:
{yoan.miche,anton.akusok,jozsef.hegedus,amaury.lendasse}@aalto.fi,
[email protected] (Yoan Miche1 , Anton Akusok1 , Jozsef Hegedus1 , Rui Nian4
and Amaury Lendasse1,2,3 )
Preprint submitted to Elsevier
July 4, 2012
this methodology provides a rapid decision on new samples, with a direct
control over the False Positives.
1. Introduction
Classification problems relying solely on the distances between the dierent samples are common in genetics [1], or syntactic and document resemblance problems [2, 3]. The reason for the direct use of the distance matrix in
these setups is that the original data does not lie in a Euclidean space, but is
usually nominal data, i.e. without any sense of ordering between two dierent
values. As such, distance matrices need to be calculated using non-Euclidean
metrics, usually.
The interest in this paper is about the problem of binary classification for
such nominal data problems, under certain specific constraints: zero False
Positives, high coverage and small computational time.
While the high coverage contraint is rather typical (achieving the highest True Positive and True Negative rates possible), the zero False Positive
constraint is not. In addition, the False Negatives are not regarded as very
important in this problem setup: even if lowering False Negatives means increasing the coverage, the most highly regarded requirement is on the False
Positives.
As mentioned, the fact that the data is nominal makes it mandatory to
use methods which directly deal with the distance matrix. A means of computing this distance matrix is first described, by the use of an approximation
technique based on Min-Wise independent hash function families.
The following Section 2 describes a very specific application of this proposed methodology to Malware detection for computer security. This application is exactly framed by the previously mentioned constraints. In addition, this application provides experimental data on which the proposed
methodology is tested in Section 4.
Section 3 describes first the matter of calculating distances between samples and then how the use of the Jaccard distance remains possible with
the low-computational time imperative, by estimating it using Locality Sensitive Hashing. A 1-Nearest Neighbor classifier is then proposed as a first
step and its shortcomings listed, while Section 4 details the complete twostep methodology which addresses these issues along with the experimental
results.
2
2. A Specific Application
The goal of Anomaly Detection in the context of computer intrusion detection [4] is to identify abnormal behavior defined as deviating from what
is considered normal behavior and signal the anomaly in order to take appropriate measures: identification of the anomaly source, shutdown/closing
of sensitive information or software. . .
Most current anomaly detection systems rely on sets of heuristics or rules
to identify this abnormality. Such rules and heuristics enable some flexibility
on the detection of new anomalies, but still require action from the expert to
tune the rules according to the new situation and the potential new anomalies identified. One ideal goal is then to have a global system capable of
learning what constitutes normal and abnormal behavior and therefore be
able to identify reliably new anomalies [5, 6]. In such a context, the only
human interaction required is the monitoring of the system, to ensure that
the learning phase happened properly.
A small part of the whole anomaly detection problem is studied in this
paper, in the form of a binary classification problem for malware and clean
samples. While the output of this problem is quite typical, the input is not.
In order to compare files together and compute a similarity between them, a
set of features is needed. F-Secure Corporation devised such a set of features
[7], based partly on sandbox execution (virtual environment for a sample
execution [8, 9]). This sandbox is capable of providing a wide variety of
behavioral information (events), which as a whole can be divided into two
main categories: hardware-specific or OS-specific. The hardware-specific information is related to the low-level, mostly CPU-specific, events occurring
during execution of the application being analyzed in the virtual environment (up to the CPU instruction flow tracing). The other category mostly
relates to the events caused by interaction of the application with the virtual OS (the sandbox). This category includes information such as General
Thread/Process events (e.g. Start/Stop/Switching), API call events, specific
events like Structure Exception Handling, system module loading etc. Besides, the sandbox can provide (upon user request) some other information
about application execution, like reaching pre-set breakpoints, detecting untypical behavioral patterns, which are not typical for traditional well-written
benign applications (e.g., so-called anti-emulation and anti-debugging tricks),
etc.
The sandbox features used in the following research are thus the dynamic
3
Static Features
Feature set
Dynamic
Features
Sample
Sandbox
Figure 1: Feature extraction from a file (sample): The sandbox runs the sample in a
virtual environment and extracts dynamic (run-time specific) information; meanwhile a
set of static features are extracted and both sets are combined in the whole feature set.
component of the collected features. Dynamic features in this context refer to
those gathered from the Sandbox while an inspected application was executed
in it. Some examples of those are what API calls were called and with
what parameters, various types of memory and code fingerprints. Static
features refer to some of the features gathered from the executable binary
itself without actually executing it. Some examples of those are what packer
it was compressed with and various code and data fingerprints. There are
15 features from the static domain and as many from the dynamic domain,
containing up to tens of thousands of values each. Each of these features
can be present or absent for one sample (e.g. if the sample studied does not
perform some classical operations in the sandbox, some features do not get
activated). As such, the input data obtained per sample usually consists of
tens of thousands of values for each feature number. The feature values are
represented by CRC64 hashes.
One of the major challenges is related to this data size: Each sample
having some tens of thousands (on average) of feature-value pairs (at most
30 features per sample, with thousands of values per feature for one sample), sample to sample comparisons are non-trivial computationally speaking. Also, due to the nature of the data, measuring similarities between files
requires specific metrics that can be applied to nominal data (i.e. with no
sense of order between values, as opposed to ordinal data). Indeed, since the
actual feature values are encoded as hashes (and represent function strings
and series of arguments, parameters. . . ), classical measures used in Euclidean
spaces do not apply. The Jaccard similarity enables such comparisons and is
4
Actual
Prediction
Malware
Clean
Malware
True Positive (TP)
False Negative (FN)
Clean
False Positive (FP)
True Negative (TN)
Table 1: Confusion Matrix for this binary classification problem.
detailed in Section 3, with the computational challenges it poses.
In addition to this specificity of the data, the requirements on the performance of the classifier are particular as well. As a security company, F-Secure
Corporation needs to have very low false positives on any anomaly detection
system deployed: If a clean file is labeled as a malware (i.e. is a false positive),
it is likely that several clients will see this same error deployed on their machines as well. This single mistake will potentially hinder seriously the work
on all the aected machines, making the clients unhappy about the product
and thus deactivating it or switching to a concurrent one. Therefore, while
typical binary classification problems addressed by machine learning focus
on optimizing the accuracy, one of the goals of the methodology presented
in this paper is to lower the false positives to achieve 0. To clarify notations,
Table 1 summarizes the confusion matrix used in this paper.
Some additional practical constraint also makes this problem particular.
Since the goal is the identification and classification of new malware samples,
there is an imperative on the time it takes to have a decision per sample:
The fastest an answer is provided, the quicker will be the deployment of the
information concerning a new sample, possibly preventing infection at many
other sites. As such, computational times need to be reduced as much as
possible.
3. Problem Description
This section first describes the problem in terms of the nature of the
data at hand, and a way to calculate distances between files, using this very
data. The matter of the computational requirements for such calculations are
addressed by an approximation based on Min-Wise independent families of
hash functions. The parameters of this approximation are then determined
and its eects investigated.
3.1. Data Specifics and Distance Calculation
3.1.1. Data Specifics
Distances in a traditional Euclidean sense are usually calculated for points
which coordinates locate them in the space. Having a data set consisting of
multiple hashes with dierent hashes representing incomparable properties
or attributes, makes that data eectively categorical, and does not allow to
calculate distances in a classical manner. The specifics and origin of the data
set used in this paper are confidential as the data is provided by F-Secure
Corporation. Original values present in the data have been hashed using the
CRC64 hash function, so as to obfuscate the original details.
The data set is composed of a large amount of files (samples), each having
the following structure:
30 possible feature numbers (each representing a dierent class of information recorded about the sample)
For each of these feature numbers, a variable amount of hashes (from
0 to tens of thousands).
The reason for this structure is that some feature numbers are standing for
a wide range of possible informations: if one such feature number stands for
the names of all the functions called in this sample, e.g., the number of
values associated to it is bound to be large for some samples. It is important
to note that the number of feature values per feature number can be very
dierent from file to file.
With this data structure, it is impossible to use traditional Machine
Learning techniques, as most of them rely on the data points position in
the sample space (usually expected to be Euclidean). In this paper, distances between samples are calculated by using the Jaccard index [10, 11],
as presented in the next subsection.
3.1.2. Distance Calculation for Nominal Data
One of the most classical similarity statistics for nominal data is the
Jaccard index [10]. It enables the computation of the similarity between
two sets of nominal attributes as the ratio between the cardinalities of their
intersection and of their union. Denoting A and B two sets of nominal
attributes, the Jaccard index is defined as
JpA, Bq
6
|A X B|
.
|A Y B|
(1)
This index intuitively gives a good sense of overlap (similarity) between
the two sets; the more common attributes (hashes in this case) they have,
the more statical and dynamical properties the corresponding files each
associated with one set share, thus the higher the chance that they are
of the same class. In addition, considering the Jaccard distance J pA, Bq
1 JpA, Bq yields an actual metric, making which enables to use Machine
Learning techniques directly.
In the case of this paper, the files not only have one set of attributes,
but multiple, identified by their feature number. As such let us redefine
A tAi uiPA , where Ai is the set of hashes associated to feature number
i, and A is the set of all feature numbers available for file A. Therefore,
the Jaccard index needs to take into account all such feature numbers. A
straightforward modification of the Jaccard index for this case is to define it
as
JpA, Bq
1
|Ai X Bi |
|C| iPC |Ai | ` |Bi | |Ai X Bi |
(2)
where Ai and Bi are the sets of feature
values for feature number i for
file A and B respectively, and C A B, with A (resp. Bq the set of the
feature numbers for file A (resp. B).
This way, only feature numbers present in both files are accounted for. In
addition, expressing the index like this enables to avoid computing the cardinality of the union, which saves some computational time, as the cardinality
of the sets Ai and Bi are known.
The computational time required for the multiple calculations of the Jaccard distance remains a problem, due to the intersection cardinality calculation. This problem is addressed in the following subsection by approximating
the Jaccard distance.
3.2. Speeding up the distance calculations
The main drawback of the original Jaccard distance lies in the computational time required for its calculation. While the intersection of two sets
(the upper part of the fraction in Eq. 2) is relatively fast for example, the Python language implementation of it has an average complexity of
Opmin t|Ai | , |Bi |uq and a worst case of O p|Ai | |Bi |q [12], the intersection
of such large sets repeated multiple times makes the total computational time
intractable. As mentioned before, the sets Ai for one single feature number
i can total some tens of thousands of elements.
7
As such, the direct Jaccard distance calculations using Eq. 2 cannot be
used. The specific requirement for this problem of near real-time computations raises the need for an fast approximation of the Jaccard distance.
3.2.1. Resemblance as an alternative to Jaccard index
Consider a file named A, and denote by |A| the number of hashes in this
file (to avoid heavy notations, it is considered that only one feature number
is present in the files; the following extends directly to the practical case of
multiple feature numbers per file). Let us define by SpA, lq the set of all
contiguous subsequences of length l of hashes of A. Using these notations,
one can define [3] the resemblance rl pA, Bq of two files A and B based on
their hashes as
|SpA, lq X SpB, lq|
rl pA, Bq
,
(3)
|SpA, lq Y SpB, lq|
which is similar to the original definition of the Jaccard index. Defining the
resemblance distance as
dl pA, Bq 1 rl pA, Bq
(4)
yields an actual metric [3, 2].
Let us fix the size of the contiguous subsequences of hashes l and denote
by l the set of all such subsequences of length l. Let us assume that l
is totally ordered and set a number of elements n. For any subset !l l
denote by MINn p!l q the set of the smallest n elements (using the order on
l ) of !l defined as
#
the set of the smallest n elements from !l , if |!l | n
MINn p!l q
(5)
!l ,
otherwise.
From [3], the following theorem gives an unbiased estimate of the resemblance rl pA, Bq.
Theorem 1. Let : l l a permutation on l chosen uniformly at
random and let M pAq MINn p pS pA, lqqq. Defining M pBq similarly, the
following is an unbiased estimate of rl pA, Bq:
rl pA, Bq
|MINn pM pAq Y M pBqq X M pAq X M pBq|
.
|MINn pM pAq Y M pBqq|
The proof can be found in [3].
As such, once a random permutation is chosen, it is possible to only use
the set M pAq (instead of the whole of A) for resemblance-based calculations.
8
3.2.2. Weak Universal Hashing and Min-Wise Independent Families
Note that while CRC64 cannot be considered as a random hash function,
the notion of weak universality for a family of hash functions proposed in
[13] makes it possible to further extend the former approximation to families
of hash functions satisfying
Pr ph ps1 q h ps2 qq
1
,
M
(6)
with h a hash function chosen uniformly at random from the family H of
functions U M, s1 and s2 elements from the origin space U of the hash
function in H and M |M|. More precisely, in [14], the definition of minwise independent family of functions is proposed in the spirit of the weak
universality concept, and the authors show that for such families of functions,
the resemblance can be computed directly.
Define as min-wise independent a family H of functions such that for any
set X v1, N w and any x P X, when the function h is chosen at random in
H, we have
1
Pr pmin thpXqu hpxqq
.
(7)
|X|
That is, all elements of the set X must have the same probability to
become the minimum element of the image of X under the function h. Assuming such a min-wise independent family H, then
Pr pmin thpSpA, lqqu min thpSpB, lqquq rl pA, Bq,
(8)
for files A and B and a function h chosen uniformly at random from H; it is
therefore possible to compute the resemblance rl pA, Bq of files A and B by
computing the cardinality of the intersection
tmin ph1 pS pA, lqqq , . . . , min phk pS pA, lqqqu
,
(9)
tmin ph1 pS pB, lqqq , . . . , min phk pS pB, lqqqu
where h1 , . . . , hk are a set of k independent random functions from H. This
way of calculating the resemblance of two files is sometimes called min-hash,
and this name is used in the rest of this paper to denote this approach.
For computational and practical reasons, in this paper only one hash
function is used (CRC64) and the cardinality of the intersection of equation
9 is approximated as the cardinality of
tmink ph pS pA, lqqqu tmink ph pS pB, lqqqu ,
(10)
9
where the notation mink pXq denotes the set of the k smallest elements in X
(assuming X is fully ordered). While this is a crude approximation, experiments show that the convergence with respect to k towards the true value
of the resemblance is assured, as shown in the following subsection.
3.2.3. Influence of the number of hashes on the proposed min-hash approximation
Figure 2 illustrates experimentally the validity of the proposed approximation of the Jaccard distance by the min-hash based resemblance. These
plots use a small subset of 3000 samples from the whole dataset, used only
for this purpose of validating the amount of hashes k required for a proper
approximation.
As can be seen, with low amounts of hashes, such as k 10 or 100
(subfigures (a) and (b)), quantization eects appear on the estimation of the
resemblance, and the estimation errors are large. These quantization problems are especially important in regard to the method using these distances
K-Nearest Neighbors , as presented in the next section: Since distances
are so much quantized, samples being at dierent distances appear to be at
the same, and can thus be taken as nearest neighbors wrongly.
The quantization eects are lessened when k reaches the hundreds of
hashes, as in subfigure (c), while the errors on the estimation remain large.
k 2000 hashes reduces such errors to only the largest distances, which are
of less importance for the following methodology. While k 10000 hashes
reduces these errors further (and even more so for larger values of k), the
main reason for using the min-hash approximation described is to reduce
drastically the computational time.
Figure 3 is a plot of the average time required per sample for the determination of the distances to the whole reference set, with respect to the
number of hashes k used for the min-hash. Thanks to the use of the Apache
Cassandra backend (with three nodes) for these calculations1 , the computational time only grows linearly with the number of hashes (and also linearly
with the number of samples in the reference set, although this is not depicted
here). Unfortunately, large values of k do not decrease the computational
time suciently for the practical application of this methodology. Therefore,
1
Details of the implementation are not given in this paper, but can be found from
the publications and deliverables of the Finnish ICT SHOK Programme Future Internet:
http://www.futureinternet.fi
10
(a) k 10 hashes
(b) k 100 hashes
(c) k 500 hashes
(d) k 1000 hashes
(e) k 2000 hashes
(f) k 10000 hashes
Figure 2: Influence of the number of hashes k over the min-hash approximation of the
resemblance r. The exact Jaccard distance is calculated using the whole amount of the
available hashes for each sample.
11
Average Time per Sample (seconds)
300
250
200
150
100
50
0
0
1000
2000
3000
4000
5000
6000
Number of hashes used (k)
7000
8000
9000
10000
Figure 3: Average time per sample (over 3000 samples) versus the number k of hashes
used for the min-hash approximation.
in the following, k 2000 hashes is used for the min-hash approximation
of the Jaccard distance, as a good compromise between computational time
and approximation error.
4. Methodology using two-stage classifiers
This section details the use of a two-stage decision strategy so as to avoid
False Positives while retaining high coverage. The first stage decision uses a
1-NN, which still yields too high False Positive rates; this rate is lowered by
using an optimized Extreme Learning Machine model, specialized either for
False Positives or False Negatives minimization.
4.1. First Stage Decision using 1-NN
4.1.1. Using K-NN with min-hash Distances
The K-Nearest Neighbor [15] method for classification is one of the most
natural to use in this setup, since it relies directly and only on distances. As
mentioned in the previous subsection, for this classifier to perform well, it
requires the proper identification of the real nearest neighbors: the approximation made using the min-hash cannot be too crude.
Using k 2000 hashes, a reference set is devised by F-Secure Corporation which contains samples that are considered to be representative of
most current malware and clean samples. This set contains about 10000 samples (for each of which the k 2000 minimum hashes have been extracted
12
Clean
Clean
Sandbox
Data
ELM FN
Unknown
Nearest Neighbors
with Jaccard Distance
Unknown
Malware
ELM FP
Malware
}
First Stage
Decision
Second Stage
Decision
Figure 4: 1-NN-ELM: Two stage methodology using first a 1-NN and then specialized
ELM models to lower false positives and false negatives. The first stage uses only the
class information C1NN of the nearest neighbor, while the second stage uses additional
neighbors information: the distance d1NN to the nearest neighbor, the distance dNN to
the nearest neighbor of the opposite class and the rank RNN (i.e. which neighbor is it)
of this opposite class neighbor.
per feature number), balanced equally between clean and malware samples.
The determination of this reference set is especially important as it should
not contain samples for which there are some uncertainties about the class:
Only samples with the highest probability of being either malware or clean
are present in the reference set.
Once this reference set is fixed, samples can be compared against it using
the min-hash based distances and a K-NN classifier.
Determining K for this problem is done using a validation set for which
the certainty of the class of each sample is very high as well. The validation
set contains 3000 samples, checked against the reference set of 10000 samples.
Figure 5 depicts the classification accuracy (average of True Positive and True
Negative rates) versus the value of K used for the K-NN. Surprisingly, the
decision based on the very first nearest neighbor is always the best in terms
of classification accuracy. Therefore, in the following methodology presented
in Section 4, a 1-NN is used as the first step classifier.
4.1.2. 1-NN is not sucient
As mentioned earlier, one of the main imperatives in this paper is to
achieve 0 False Positives (in absolute numbers). As Table 2 depicts, by using
a test set (totally separate from the validation sets used above) composed of
28510 samples for which the class is known with the highest confidence, with
13
0.95
Classification Accuracy
0.945
0.94
0.935
0.93
0.925
0.92
0.915
0.91
7
9
11
13
Number of Nearest Neighbors used (K)
15
17
Figure 5: K 1 is the best for this specific data regarding classification accuracy.
Prediction
Malware
Clean
Actual
Malware Clean
18160
183
277
9890
Table 2: Confusion Matrix for the sole 1-NN on the test set. If only the first stage of the
methodology is used, results are unacceptable in terms of False Positive rates.
the 1-NN approach still yields large amounts of False Positives. Note that
this test set is unbalanced, although not significantly.
The results of the 1-NN are not satisfactory regarding the constraint on
the False Positives. An obvious way of addressing directly the amount of
False Positives is to set a maximum threshold on the distance to the first
nearest neighbor: Above this threshold, the sample is deemed too far from
its nearest neighbor, and no decision is taken.
While this strategy would eectively reduce the number of False Positives,
it lowers significantly the number of True Positives as well, i.e. the coverage.
For this reason, and to keep a high coverage, the following methodology using
a second stage classifier as the ELM, is proposed.
As can be seen from Figure 3, the computational time required to calculate the distance from a test sample to the whole 10000 reference set samples
is about 35 seconds on average, using k 2000 hashes. This is still acceptable, from the practical point of view, but adding a second stage classifier
14
(a) Case 1
(b) Case 2
Figure 6: Illustration of dierent situations with identical 1-NN: in (a) the density of reference samples of the same class around the test sample gives the decision high confidence;
in (b) while the 1-NN is of the same class as for (a), the confidence should be very dierent
on the decision.
has the obvious drawback of increasing this time.
In order to make this increase the smallest possible, an Extreme Learning
Machine model specialized for False Positives (and another for False Negatives) is used. Figure 4 illustrates the global idea of this two-stage methodology.
The motivation for an additional classifier comes from the fact that the
single information from the 1-NN is not sucient: the distance to that first
neighbor is important as well, and so is the distance and the rank of the
nearest neighbor of the opposite class. Figure 6 attempts to illustrate two
dierent situations for which a test sample has its first nearest neighbor in
the same class note that the position of the samples has no meaning here,
due to the nominal nature of the data; the distances are the interesting fact.
In the first case (a), the confidence on the decision must be high, as many of
the neighbors of the test sample are near and of the same class. The case (b)
is very dierent and needs to have a much lower confidence on the decision
taken, if any.
A means of describing such situations is to account for:
1. The distance to the nearest neighbor d1NN : If the nearest neighbor is
far, it is likely that the test sample is in a part of the original space
where the reference samples density is insucient;
2. The distance to the nearest neighbor of the opposite class dNN : If
d1NN is very similar to dNN , the test sample lies in a part of the
space where reference samples of both classes are present and at similar
15
distances;
3. The rank of this neighbor of opposite class RNN (is it the 3rd or
100th neighbor?): This information gives a rough sense of the density
of the reference samples of the same class as that of the nearest neighbor
around the test sample.
The combination of these additional three pieces of information describes
roughly the situation in which the test sample lies. This is the information
fed to the second stage classifier for the final decision.
4.2. Second Stage Decision using modified ELM
4.2.1. Original ELM
The Extreme Learning Machine (ELM) algorithm was originally proposed
by Guang-Bin Huang et al. in [16, 17, 18, 19] and it uses the Single Layer
Feedforward Neural Network (SLFN) structure. The main concept behind
the ELM lies in the random initialization of the SLFN weights and biases.
Then, under certain conditions, the synaptic input weights and biases do not
need to be adjusted (classically through an iterative updates such as backpropagation) and it is possible to calculate implicitly the hidden layer output matrix and hence the output weights. The complete network structure
(weights and biases) is thus obtained with very few steps and very low computational cost (compared to iterative methods for determining the weights,
e.g.).
Consider a set of M distinct samples pxi , yi q with xi P Rd1 and yi P Rd2 ;
then, a SLFN with N hidden neurons is modeled as the following sum
N
i1
pwi xj ` bi q, j P J1, M K,
(11)
pwi xj ` bi q yj , j P J1, M K,
(12)
with being the activation function, wi the input weights, bi the biases and
i the output weights.
In the case where the SLFN would perfectly approximate the data, the
i and the actual outputs yi are zero
errors between the estimated outputs y
and the relation between inputs, weights and outputs is then
N
i1
which writes compactly as H Y, with
16
pw1 x1 ` b1 q
..
.
...
pw1 xM ` b1 q
pwN x1 ` bN q
..
,
.
pwN xM ` bN q
(13)
T T
T T
and p 1T . . . N
q and Y py1T . . . yM
q .
Solving the output weights
from the hidden layer output matrix H
and target values is achieved through the use of a Moore-Penrose generalized
inverse of the matrix H, denoted as H: [20].
Theoretical proofs and a more thorough presentation of the ELM algorithm are detailed in the original paper [16]. In Huang et al.s later work it
has been proved that the ELM is able to perform universal function approximation [19].
4.2.2. False Positive/Negative Optimized ELM
As depicted on Figure 6 and mentioned above, the single information of
the class of the nearest neighbor is not sucient to obtain 0 False Positives.
The proposed second stage classifier uses modified ELM models for lowering the amounts of False Positives one of the two modified ELM models
reduces False Negatives as well; only the False Positive minimizing one is
mentioned in the following.
The modified ELM model used in the second stage of the methodology
is specially optimized so as to minimize the False Positives (a similar model
to minimize the False Negatives is used as well, in the same fashion). It uses
additional information gathered while searching for the nearest neighbor (so
no additional computational time is required to obtain the training data): the
distance to the nearest neighbor d1NN , the distance to the nearest neighbor
of the opposite class dNN , and the rank of this neighbor of opposite class
RNN . With this input data, the False Positive Optimized ELM is trained
using a weighted classification accuracy criterion.
While for binary classification problems, the classification rate Acc defined
as the average of the True Positive Rate TPR and True Negative Rate TNR,
TNR ` TPR
,
(14)
2
is typically used as a performance measure, the proposed modified ELM uses
the following weighted accuracy Accpq
Acc
17
TNR ` TPR
.
(15)
1`
By changing the weight, it becomes possible to give precedence to the True
Negative Rate and thus to avoid false positives. The output of the proposed
False Positive Optimized ELM is calculated using Leave-One-Out (LOO)
PRESS (PREdiction Sum of Squares) statistics which provides a direct and
exact formula for the calculation of the LOO error "PRESS for linear models.
See [21] and [22] for details of this formula and its implementations:
Accpq
"PRESS
yi hi i
,
1 hi PhTi
(16)
where P is defined as P pHT Hq1 , H is the hidden layer output matrix of
the ELM and i are the output weights of the ELM.
In order to obtain a parsimonious model in the shortest possible time,
the proposed modified ELM uses the idea of the TROP-ELM [23] and OPELM [24, 25, 26, 27, 28] to prune out neurons from an initially large ELM
model [29]. In addition, for computational time considerations, the maximum number M of selected neurons desired for the final model is taken as
a parameter. Overall, the False Positive Optimized ELM used in this paper
follows the steps of Algorithm 1.
Algorithm 1 False Positive Optimized ELM.
Given a training set pxi , yi q, xi P R3 , yi P t1, 1u, an activation function
: R R, a large number of hidden nodes N and the maximum number
M N of neurons to retain for the final model:
- Randomly assign input weights wi and biases bi , i P J1, N K;
- Calculate the hidden layer output matrix H as in Equation 13;
for i 1 to M do
- Perform Forward Selection of the i best neurons (among N ) using
PRESS LOO output with Accpq criterion, and ELM determination of
the output weights i ;
end for
- Retain the best combination out of the M dierent selections as the final
model structure.
The selection of the optimal is done experimentally, following the two
constraints of 0 False Positives and highest possible coverage (i.e. as many
18
1
0.9
TP rate
0.8
0.7
0.6
0.5
0.4
0.01
0.02
0.03
0.04
FP rate
0.05
0.06
0.07
0.08
Figure 7: ROC curve (True Positive Rate versus False Positive Rate) for varying values
of .
True Positives as possible). Figure 7 is the Receiver Operating Characteristic
curve for various values of , plotted for a balanced 3000 samples validation
set. As can be seen, the requirement on absolutely 0 False Positives has
a strong influence on the coverage (represented by the True Positives rate
here). If one allows as low as 0.06% False Positives, the coverage reaches
92% already.
Figure 8 depicts the plot of the False Positive rate against the value.
This plot is using the same validation data as Figure 7. The value of for
which the 0 False Positives requirement is met while keeping highest possible
coverage is 30, form Figure 8.
4.3. Final Results on Test Data
With the parameters of the two-stage methodology determined as above,
i.e.:
k 2000 hashes used for the min-hash approximation of the Jaccard
distance;
K 1 for the K-NN first stage classifier;
=30 for the False Positive Optimized ELM second stage classifier,
19
0.8
0.7
False Positive Rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
15
20
25
30
Figure 8: Evolution of the False Positive Rate as a function of the weight. The first
attained 0 False Positive Rate is for 30.
the presented methodology is applied to a test set of 28510 samples spanning
from early 2008 until late 2011. The reference set of 10000 samples mentioned
before is within the same time frame and balanced between malware and
clean so as to reflect the real proportions, i.e. that of the samples received
by F-Secure Corporation. The proportions are roughly 2{3 malware
and 1{3 clean.
Tables 3 give the previous results of the sole 1-NN to be compared against
the ones of the 1-NN and False Positive Optimized ELM methodology.
It can be seen that the False Positive rate achieved in test is in line with
the results from the Leave-One-Out in (a).
The results depicted in Table 3 (c) use not only a False Positive Optimized ELM but also a False Negative Optimized ELM, to reduce the False
Negatives, as mentioned on Figure 4. The improvements in the reduction of
the False Positives and the coverage achieved are satisfying for this test set.
A value of 2 False Positives in this test set is probably acceptable in
practice. If the strict goal of 0 False Positives in test is to be enforced,
then one possibility is to increase the parameter to a higher value, more
conservative. This has the eect of lowering further the coverage, though.
Note on hardware and computational time considerations. While the details
of the implementation are not mentioned in this paper, the proposed method20
Malware
Prediction
Clean
Unknown
Actual
Malware Clean
1930
1
1
908
2473
1623
(a) Confusion Matrix for the two-stage classifier methodology on the
training data (Leave-One-Out results).
Actual
Malware Clean
Malware
18160
183
Prediction
Clean
277
9890
(b) Confusion Matrix for the sole 1-NN on the test set.
Actual
Malware Clean
Malware
8393
2
Prediction
Clean
7
4115
Unknown
10037
5956
(c) Confusion Matrix for the two-stage classifier methodology on the test
set.
Table 3: Confusion matrices for (a) the training data (Leave-One-Out results) when training the False Positive/Negative Optimized ELMs; on the whole test set, (b) using only
the 1-NN approach and (c) using the proposed 1-NN and ELM two-stage methodology.
The reduction in coverage from the second stage ELM is noticeable, as False Positives and
Negatives are decreased significantly.
21
ology uses a set of three computers, each equipped with 8GB of RAM, and
Intel Core2 Quad CPUs. Apache Cassandra is the distributed database
framework used for performing ecient min-hash computations in batches,
and a memory-held queueing system (based on memcached) is holding jobs
for execution against Cassandra database. All additional computations are
performed using Python code on one of the three computers mentioned.
With this setup, as seen on Figure 3, the average per sample evaluation
time i.e. calculating pairwise distances to the 10000 reference samples and
finding the closest elements is about 35 seconds. The choice of Cassandra
as a database backend is meant so that the computational time grows only
linearly if the precision of the min-hash or the number of reference samples
is increased linearly: growing the number of reference samples linearly or
the number k of hashes used for the min-hash approximation only requires a
linear growth in the number of Cassandra nodes for the computational time
to remain identical.
5. Conclusions
This paper proposes a practical case oriented methodology for a binary
classification problem in the domain of Anomaly Detection. The practical
problem at hand lies in the classification of files (samples) as either malware
or clean, based on specific sets of nominal attributes, thus requiring purely
distance-based Machine Learning techniques. The practical requirements for
this binary classification problem are somewhat unusual, as no False Positives
can be tolerated, while as many files as possible should be classified in the
minimum computational time. The False Negatives are not as important in
this context.
In order to perform file to file comparisons, a distance measure known
as the Jaccard distance is adapted to this problem setup, and a fast approximation of it, the Min-Hash approximation, is proposed. The Min-Hash
approach enables to obtain an estimation of the Jaccard distance using a restricted amount of the whole sets of attributes of each file, thus lowering the
computational time significantly. This approximation is shown to converge
experimentally to the true Jaccard distance, given enough hashes.
A two-stage decision process using two dierent types classifiers enables
to provide a fast decision while keeping the False Positive rate low: A 1-NN
model using the estimated Jaccard distance provides an initial decision on
the test sample at hand. Following in the second stage is a False Positive
22
Optimized ELM a False Negative Optimized ELM is used as well to reduce
False Negatives , which enables to reduce drastically the False positives,
from 183 to 2 in test, at the cost of a lower coverage. Another advantage of
the ELM-based second classifier is its very low computational time, allowing
to have this second-stage decision for almost no additional time.
Overall, the methodology proves to be ecient for this specific problem
and has the advantage of having only two parameters that require tuning:
the number of hashes used for the Min-Hash approximation the more used,
the close is the approximation to the real Jaccard distance value , and the
coecient weighting the False Positives in the modified ELM criterion the
value of this coecient controls the tradeo between False Positive rate and
coverage directly.
The parameters devised experimentally for the specific reference set enable to reach only 2 False Positives in test, with a coverage on the malware
files of 44%. This methodology is currently being tested at F-Secure Corporation on dierent data sets (reference and test) for further validation.
References
[1] S. Lele, J. T. Richtsmeier, Euclidean distance matrix analysis: a
coordinate-free approach for comparing biological shapes using landmark data., American journal of physical anthropology 86 (3) (1991)
415427.
[2] A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig, Syntactic
clustering of the Web, Computer Networks and ISDN Systems 29 (8-13)
(1997) 11571166.
[3] A. Z. Broder, On the resemblance and Containment of Documents, in:
Compression and Complexity of SEQUENCES 1997, IEEE Comput.
Soc, 1997, pp. 2129.
[4] Y. Robiah, S. S. Rahayu, M. M. Zaki, S. Shahrin, M. A. Faizal, R. Marliza, A New Generic Taxonomy on Hybrid Malware Detection Technique,
arXiv.org cs.CR.
[5] A. Srivastava, J. Gin, Automatic Discovery of Parasitic Malware, in:
S. Jha, R. Sommer, C. Kreibich (Eds.), Recent Advances in Intrusion
Detection (RAID10), Springer Berlin / Heidelberg, 2010, pp. 97117.
23
[6] M. Bailey, J. Andersen, Z. Morleymao, F. Jahanian, Automated classification and analysis of internet malware, in: Recent Advances in Intrusion Detection (RAID07), 2007.
[7] F-Secure Corporation, F-Secure DeepGuard A proactive response to
the evolving threat scenario (Nov. 2006).
[8] C. Willems, T. Holz, F. Freiling, Toward Automated Dynamic Malware
Analysis Using CWSandbox, IEEE Security and Privacy 5 (2007) 3239.
[9] K. Yoshioka, Y. Hosobuchi, T. Orii, T. Matsumoto, Vulnerability in
Public Malware Sandbox Analysis Systems, in: Proceedings of the 2010
10th IEEE/IPSJ International Symposium on Applications and the Internet, IEEE Computer Society, Washington, DC, USA, 2010, pp. 265
268.
[10] P. Jaccard, tude comparative de la distribution florale dans une portion des alpes et du jura, Bulletin de la Socit Vaudoise des Sciences
Naturelles 37 (1901) 547579.
[11] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st
Edition, Addison Wesley, 2005.
[12] Python, Python algorithms complexity, http://wiki.python.org/
moin/TimeComplexity#set (December 2010).
URL http://wiki.python.org/moin/TimeComplexity#set
[13] J. L. Carter, M. N. Wegman, Universal Classes of Hash Functions, Journal of Computer and System Sciences 18 (2) (1979) 143154.
[14] A. Z. Broder, M. Charikar, A. M. Frieze, M. Mitzenmacher, Min-wise
Independent Permutations, Journal of Computer and System Sciences
60 (1998) 327336.
[15] T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEE
Transactions on Information Theory 13 (1) (1967) 2127.
[16] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme Learning Machine: Theory and Applications, Neurocomputing 70 (2006) 489501.
24
[17] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme Learning Machine
for Regression and Multiclass Classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42 (2) (2012) 513529.
[18] G.-B. Huang, Q.-Y. Zhu, K. Z. Mao, C.-K. Siew, P. Saratchandran,
N. Sundararajan, Can threshold networks be trained directly?, IEEE
Transactions on Circuits and Systems II: Express Briefs 53 (3) (2006)
187191.
[19] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes,
IEEE Transactions on Neural Networks 17 (4) (2006) 879892.
[20] C. R. Rao, S. K. Mitra, Generalized Inverse of Matrices and Its Applications, John Wiley & Sons Inc, 1971.
[21] R. Myers, Classical and Modern Regression with Applications, 2nd edition, Duxbury, Pacific Grove, CA, USA, 1990.
[22] G. Bontempi, M. Birattari, H. Bersini, Recursive lazy learning for modeling and control, in: European Conference on Machine Learning, 1998,
pp. 292303.
[23] Y. Miche, M. van Heeswijk, P. Bas, O. Simula, A. Lendasse, TROPELM: a double-regularized ELM using LARS and tikhonov regularization, Neurocomputing 74 (16) (2011) 24132421. doi:10.1016/j.
neucom.2010.12.042.
[24] E. Group, The op-elm toolbox, available online at http://www.
cis.hut.fi/projects/eiml/research/downloads/op-elm-toolbox
(2009).
[25] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OPELM: Optimally-pruned extreme learning machine, IEEE Transactions
on Neural Networks 21 (1) (2010) 158162. doi:10.1109/{TNN}.2009.
2036259.
[26] Y. Miche, P. Bas, C. Jutten, O. Simula, A. Lendasse, A methodology for
building regression models using extreme learning machine: OP-ELM,
25
in: M. Verleysen (Ed.), ESANN 2008, European Symposium on Artificial Neural Networks, Bruges, Belgium, d-side publ. (Evere, Belgium),
2008, pp. 247252.
[27] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, GPU-accelerated and
parallelized ELM ensembles for large-scale regression, Neurocomputing
74 (16) (2011) 24302437. doi:10.1016/j.neucom.2010.11.034.
[28] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, Solving large regression
problems using an ensemble of GPU-accelerated ELMs, in: M. Verleysen (Ed.), ESANN2010: 18th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, d-side
Publications, Bruges, Belgium, 2010, pp. 309314.
[29] Y. Lan, Y. C. Soh, G.-B. Huang, Constructive hidden nodes selection
of extreme learning machine for regression, Neurocomputing 73 (16-18)
(2010) 31913199.
26
2 10
40%
10
5000
5000
2008
2008
A2008
5000
A2009
5000
5000
2009
2008
Nc
5000
5000
56000
A2008
A2009
B2009
Nm
5000
5000
56000
2008
2009
2009
A2008
A2008
A2009
B2009
A2009
|B2009 | = 1.12 105 A2009
B2009
A2009
2009
10000
B2009
B2009
A2009
A2009
B2009
i 2 (A2008 [ A2009 [ B2009 )
Ci )
i 2 (A2008 [ A2009 [ B2009 )
Ci
30
600
Ci
x
y(x) = |{i : |Ci | = x, i 2 A2008 }|
x t 103
Ci
Jij 0 Jij 1
Jij
i
j
Jij
j
M k (j)
j
k
k
M (j)
M k (j)
M k (j)
Jij
Jij
j
M k (j)
i
M k (j)
Jij
s
9 i : Jij > s
i
j
j
j
j
s
k
k
j
j
k
Jij
Ci
Cj
Jij
Jij
|Ci \ Cj |
cos
Jij
=p
.
|Ci ||Cj |
ql
S
C = i2A Ci
q = (q1 , q2 , . . . , ql , . . . , qL ),
L = |C|.
q
ci ,
0 1
Ci
ci = (ci1 , ..., cil , ..., ciL ).
cil
cil
1 ql 2 C i
.
0 ql 2
/ Ci
cos
Jij
ci
cos
Jij
=
cj
c|i cj
.
ci cj
N = 105
L
cos
Jij
ci
dL
=0
e
ci = Rci ,
2
=d
E(R| R) = I
c|i
cj
E
c|i cj
E(
c|i
cj ) = E(c|i R| Rcj ) = c|i E(R| R)c|j = c|i cj = |Ci \ Cj |.
c|i
cj
|
ci
cj
c|i cj
|Ci \ Cj |
n
k = O(
log(n))
1
1+
i, j
Jac
=
Jij
N = 106
|Ci \ Cj |
.
|Ci [ Cj |
L
N = 105
R
ci
d
ci
R
ci
k
ci
[
ci ] k
k=1
d
[
ci ] k
0
RN D(, )
N (, )
ci
e
ci = Rci
j 2 Ci
RN D
k=1
d
r
RN D( = 0,
[
ci ] k
[
ci ] k + r
k s
=d
d
sopt
d
d
A2008
A2009
A2009
k
k s
k
s
d
s
k
s
i
d
k
sopt
sEC
opt
NFEC
P (k, s)
k
NFEC
P (k, s)
k
FP0
k = kmin
(s)
FP0
kmin
(s)
NF P (k, s) = 0
A2009
k
FP0
kmin
FP0
k < kmin
NF P (k, s) > 0
5000
NF P (k, s) = 0
Alm (k, s)
lm
Alm (k2 , s) Alm (k1 , s)
k1
k2
k2 > k1
k2
k1
k
NTEC
P (k, s)
k
NTEC
P (k, s)
sEC
opt
EC F P 0
NT P (kmin , s)
NFEC
P (k, s)
k
s
FP0
NTEC
(k
=
k
min , s)
P
s
s
FP0
NTEC
P (kmin , s)
s = 0.1
FP0
NTEC
P (kmin , s)
cos
Ji,j
s = 0.1
s = 0.1
i
j
FP0
(s)
kmin
FP0
NTEC
P (kmin , s)
sopt = 0.1
i
|Ci |
Ci = Cj ,
s = 0.1
sopt = 0.1
s = 0.1
s=0
s
FP0
kmin
s=0
FP0
kmin
s = 0.1
80
500
FP0
kmin
s
s
FP0
kmin
s
k
j
s1
s1
s = s1
s
j
j
m+1
j
s = s2 > s 1
j
m+1
FP0
kmin
(s2 )
FP0
kmin
(s1 )
j
m+1
j
m
j
km
s2
k=m
k
j
k
FP0
s kmin
(s)
s
k
FP0
kmin
(s)
s = 0.1
s = 0.1
s=0
s=0
500
0.1
FP0
kmin
FP0
kmin
(s)
sopt
sopt
sEJ
opt = 0
FP0
NTEJ
P (kmin , s)
EC F P 0
NT P (kmin , s)
EJ
FP0
NTEC
P (kmin , s)
NTEJ
P
EJ
FP0
NT P (kmin , s)
A
s
k
FP0
kmin
k
FP0
kmin
s
s=0
FP0
kmin
FP0
(s)
kmin
s = 0.1
s 2 [0, 0.25]
FP0
(s)
kmin
s = 0.1
s = 0.1
k
s = 0.3
s 2 [0, 0.25]
EC F P 0
Ntp
(kmin , s)
FP0
kmin
(s)
EC
y = NTEC
P (k, s = sopt = 0.1)
0.1)
EJ
x = NFEJ
P (k, s = sopt = 0)
EC
x = NFEC
P (k, s = sopt =
k
EJ
y = NTEJ
P (k, s = sopt = 0)
A2009
5000
2500
2500
y
5000
3500
d!1
s = sEC
opt = 0.1
d
A
FP0
EC
NTRP
P (kmin , s = sopt , d)
d
NTRP
P = 2800
EC F P 0
NT P (kmin , sopt ) = 2798
d = 6000
d
FP0
kmin
(sopt , d)
dopt
dopt = 4000
d = 4000
d = 6000
FP0
kmin
(sopt , d)
FP0
kmin (sopt , d) 80
d
d = 6000
d = 4000
d > 6000
FP0
EC
FP0
NTRP
kmin
(sopt , d)
P (kmin , s = sopt , d)
s = sopt = 0.1
FP0
kmin
e2008 |
Ntrain = |A
B2009
A2009
d = 4000
Btest B
A2009
A2009
B \ A2009 = ;
|Btest | > |A2009 |
e2008 A2008
A
A2008
Ntrain
d = 4000
e2009
B
Ntrain
e2008
A
A2008
Nval
B2009
k
e2008
A
e2009
B
FP0
kmin
A2009
B2009
e2009
B
A2008
FP0
kmin
s = 0.1
FP0
kmin
e2009
B2009 \ B
B2009
NPtest
Nval = 25000
FP0
FP0
kmin
NTRP
P (kmin , s = 0.1)
Nval = 5000
FP0
NFRP
(k
min , s = 0.1)
P
Ntrain
Nval = 25000
Nval = 25000
test
NN
Nval
5000
25000
Nval = 5000
Nval = 5000
FP0
kmin
Ntrain = 10000
FP0
NTRP
P (kmin , s = 0.1)
Ntrain
FP0
NTRP
P (kmin , s = 0.1)
5000
FP0
kmin
Ntrain
Ntrain
e2008
A
Ntrain
A2008
e2008
A
A2008
k
FP0
kmin
Ntrain
FP0
kmin
Nval = 25000
Nval = 5000
FP0
kmin
4 10
Nval = 25000
(106
1000
103 )/2
1000
(106
103 )/2
i
5
7
ci
d < 200
ci
d
5
5
7
8
|Ci |
d
1000
35
generating rand vectors and addition
setting the seed
30
time (s)
25
20
15
10
500
1000
1500
2000
2500
d
As2008 A2008
3000
3500
4000
4500
d
5
A2008
As2009 A2009
As2008 As2009
5000
|Ci | 87000
A2008
A2009
A2009
20%
1000
1000
2008 2009
3.0
Q9650
GT X470
50
d=1000
d=2000
d=3000
d=4000
d=5000
45
40
35
time (s)
30
25
20
15
10
5
0
6
8
number of hash values
10
12
4
x 10
y
i
i
7
7
8
8
j
i
i
i
|Ci |,
d
y
y
x = |Ci |
{(i, j) : i 2 As2008 , j 2 As2009 }
x
y
d = 4000
4
q
1
N
appr.
i (Ji
Jiexact )2
0.0162
J exact )
0.8
0.0118
0.0224
2d/d 0.0224
Ci
ci
600
ci
ci
cj
ci
ci =
c|j
cj
c|i
d
2d
d
2
d
p
p
2d/d 0.0224
ci
2d
=d
10
4000
#
#
K=1
0%
1 n 10
40%
0.002%
10
98.6%
74.37%
86%
97.11%
40%
85%
2.5%
0.3%
1%
3.80%
0.004%
0.02%
71%
21
105
105
2 10
40%
d = 4000)
d = 4000)
500
104