Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views5 pages

Hyperspectral Image Classification Using Relevance Vector Machines

comparacion de las tecnicas de clasificacion SVM y RVM para el reconocmiento de patrones

Uploaded by

jose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views5 pages

Hyperspectral Image Classification Using Relevance Vector Machines

comparacion de las tecnicas de clasificacion SVM y RVM para el reconocmiento de patrones

Uploaded by

jose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

586

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 4, NO. 4, OCTOBER 2007

Hyperspectral Image Classification Using


Relevance Vector Machines
Begm Demir, Student Member, IEEE, and Sarp Ertrk, Member, IEEE

AbstractThis letter presents a hyperspectral image classification method based on relevance vector machines (RVMs). Support vector machine (SVM)-based approaches have been recently
proposed for hyperspectral image classification and have raised
important interest. In this letter, it is genuinely proposed to use
an RVM-based approach for the classification of hyperspectral
images. It is shown that approximately the same classification
accuracy is obtained using RVM-based classification, with a significantly smaller relevance vector rate and, therefore, much faster
testing time, compared with SVM-based classification. This feature makes the RVM-based hyperspectral classification approach
more suitable for applications that require low complexity and,
possibly, real-time classification.
Index TermsClassification, hyperspectral images, relevance
vector machines (RVMs), support vector machines (SVMs).

I. I NTRODUCTION

faster with the RVM compared to the SVM. For example, in


[17], the RVM has been used for the detection of microcalcification clusters in digital mammograms, and it has been shown that
the RVM classifier is much more suitable for real-time processing and reduces the computational complexity compared to
SVM-based classification, while maintaining similar detection
accuracy.
It is proposed in this letter to utilize the RVM for classification of hyperspectral images. It is shown that the RVMbased classification approach can provide similar classification
accuracy (AC) as the SVM-based classification, with a significantly reduced number of RVs. This feature makes the RVMbased hyperspectral classification approach more suitable for
applications that require low complexity and, possibly, realtime classification.

UPPORT vector machine (SVM)-based approaches [1][5]


have been recently proposed for regression and classification tasks in multispectral [6], [7] and hyperspectral [8][12]
images. For example, in [8], SVM classifiers have been applied
to hyperspectral Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data. The effectiveness of SVM-based hyperspectral image classification has been addressed in [9]. An SVM
classification system that estimates the SVM parameters in
a fully automatic way is presented in [10]. Different kernelbased approaches and their properties have been analyzed for
hyperspectral image classification in [11]. In [12], a smoothing
preprocessing step is introduced before SVM classification, to
take spatial context into consideration and improve the classification rate. In [13] and [14], it has been proposed to use SVMbased regression for density estimation of class conditional
probabilities for maximum a posteriori classification.
Relevance vector machine (RVM)-based regression and classification have been proposed in [15][17]. The advantages of
the RVM over the SVM are probabilistic predictions, automatic estimations of parameters, and the possibility of choosing
arbitrary kernel functions [15], [16]. Most importantly, RVM
classification results in fewer relevance vectors (RVs) compared
with the number of support vectors (SVs) obtained in the SVM
classification. Hence, classification can be carried out much
Manuscript received January 11, 2007; revised April 2, 2007. This work
was supported by the Scientific and Technological Research Council of Turkey
(TUBITAK) under the Hyperspectral Classification, Segmentation and Recognition project.
The authors are with the Kocaeli University Laboratory of Image and Signal
Processing (KULIS), University of Kocaeli, 41040 Kocaeli, Turkey.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LGRS.2007.903069

II. RVM C LASSIFICATION


Supervised learning techniques make use of a training set that
consists of a set of sample input vectors {xn }N
n=1 together with
the corresponding targets {tn }N
n=1 . The targets are basically
real values in regression tasks or class labels in classification
problems. It is typically desired to learn a model of the dependency of the targets on the inputs from the training set,
so that accurate predictions of t can be made for previously
unseen values of x. Commonly, these predications can be based
on some function y(x) defined over the input space in the
form of
y(x; w) =

M


wi i (x) = wT (x)

(1)

i=1

as a linearly weighted sum of M (generally nonlinear and


fixed) basis functions (x) = (1 (x), 2 (x), . . . , M (x))T .
Although this model is linear in the parameters (or weights)
w = (w1 , w2 , . . . , wM )T , it can still be highly flexible as the
size of the basis set M can be effectively large.
Learning is basically the process of inferring the function
or, equivalently, the parameters of the function y(x). In this
context, it is desired to estimate reasonable values for the
parameters (or weights) w = (w1 , w2 , . . . , wM )T . Given a set
of N corresponding training pairs {xn , tn }N
n=1 , the objective
is to find values for the weights w = (w1 , w2 , . . . , wM )T , such
that y(x) generalizes well enough to new data, yet only a few
elements of w are nonzero [15]. Having only a few nonzero
weights facilitates a sparse representation with the advantage
of providing fast implementation.

1545-598X/$25.00 2007 IEEE

DEMIR AND ERTRK: HYPERSPECTRAL IMAGE CLASSIFICATION

587

The SVM [1][5] provides a successful approach to supervised learning by making predictions based on a function in the
form of

y(x; w) =

N


model. Following the definition of the Bernoulli distribution,


the likelihood is written as
p(t|w) =

N


{y(xn ; w)}tn [1 {y(xn ; w)}]1tn

(3)

n=1

wi K(x, xi ) + w0

(2)

i=1

where wi shows the model weights, and K(, ) is a kernel


function effectively defining one basis function for each sample
in the training set. The key feature of the SVM classification
is that, its target function attempts to minimize a measure of
error on the training set, while simultaneously maximizing the
margin between the two classes that are implicitly defined in
the feature space by the kernel K [2]. This process results in a
sparse model that depends only on a subset of kernel functions,
namely, those associated with training examples that lie either
on the margin or on the wrong side, and the corresponding training examples are referred to as SVs. SVM is quite popular in
supervised learning applications and has been recently applied
for regression and classification of multispectral [6], [7] as well
as hyperspectral images [8][12], and therefore, the reader is
referred to these references to avoid rephrasing the basics of
SVM. Although the SVM classification provides successful
results, a number of significant and practical disadvantages are
identified as [15], [16] follows.
Although SVMs are relatively sparse, the number of SVs
typically grows linearly with the size of the training set,
and therefore, SVMs make unnecessarily liberal use of
basis functions.
Predictions are not probabilistic, and therefore, SVM is
not suitable for classification tasks in which posterior
probabilities of class membership are necessary.
In SVM, it is required to estimate the error/margin tradeoff
parameter C, which generally entails a cross-validation
procedure which can be a waste of data as well as
computation.
In SVM, the kernel function must satisfy Mercers condition; hence, it must be a continuous symmetric kernel of a
positive integral operator.
The RVM has been introduced by Tipping [15], [16] as a
Bayesian treatment alternative to the SVM that does not suffer
from the aforementioned limitations. The RVM introduces a
prior over the model weights governed by a set of hyperparameters, in a probabilistic framework. One hyperparameter
is associated with each weight, and the most probable values
are iteratively estimated from the training data. The most
compelling feature of the RVM is that it typically utilizes
significantly fewer kernel functions compared to the SVM,
while providing a similar performance.
For two-class classification, any target can be classified into
two classes such that tn {0, 1}. A Bernoulli distribution can
be adopted for p(t|x) in the probabilistic framework because
only two values (0 and 1) are possible. The logistic sigmoid
link function (y) = 1/(1 + ey ) is applied to y(x) to link
random and systematic components, and generalize the linear

for the targets tn {0, 1}.


The likelihood is complemented by a prior over the parameters (weights) in the form of
p(w|) =



N

i
w2
exp i i
2
2
i=1

(4)

where = (1 , 2 , . . . , N )T shows the hyperparameters introduced to control the strength of the prior over its associated
weight. Hence, the prior is Gaussian, but conditioned on .
For a certain value, the posterior weight distribution conditioned on the data can be obtained using Bayes rule, i.e.,
p(w|t, ) =

p(t|w)p(w|)
p(t|)

(5)

where p(t|w) is the likelihood, p(w|) is the prior, and p(t|)


is referred to as evidence.
The weights cannot be analytically obtained, and therefore,
a Laplacian approximation procedure [18] is used.
1) Since p(w|t, ) is linearly proportional to p(t|w)
p(w|), it is possible to aim to find the maximum of
log {p(t|w)p(w|)}
N

1
[tn log yn + (1 tn ) log(1 yn )] wT Aw
=
2
n=1

(6)

for the most probable weights wMP , with yn =


{y(xn ; w)} and A = diag(0 , 1 , . . . , N ) being
composed of the current values of . This is a penalized
logistic log-likelihood function and requires iterative
maximization. The iteratively reweighed least-squares
algorithm [15], [19] can be used to find wMP .
2) The logistic log-likelihood function can be differentiated
twice to obtain the Hessian in the form of
w w log p(w|t, )|wMP = (T B + A)

(7)

where B = diag(1 , 2 , . . . , N ) is a diagonal matrix


with n = {y(xn ; wMP )}[1 {y(xn ; wMP )}], and
is the design matrix with nm = K(xn , xm1 ) and
n1 = 1. This result is then negated and inverted to give
the covariance , as shown as follows, for a Gaussian
approximation to the posterior over weights centered
at wMP :
= (T B + A)1 .

(8)

In this way, the classification problem is locally linearized


around wMP in an effective way with
wMP = T Bt

(9)
1

t = wMP + B (t y).

(10)

588

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 4, NO. 4, OCTOBER 2007

TABLE I
NUMBER OF TRAINING AND TEST SAMPLES

TABLE II
CLASSIFICATION AC AND NUMBER OF SVS FOR SVM

TABLE III
CLASSIFICATION AC AND NUMBER OF RVS FOR RVM

These equations are basically equivalent to the solution of a


generalized least-squares problem. After obtaining wMP , the
hyperparameters i are updated using inew = i /wi2 , where
wi is the ith posterior mean weight, and i is defined as
i = 1 i ii , where ii is the ith diagonal element of the
covariance, and can be regarded as a measure of how well
determined each parameter wi is by the data. During the
optimization process, many i will have large values, and
thus, the corresponding model weights are pruned out, realizing
sparsity. The optimization process typically continues until the
maximum change in i values is below a certain threshold or
the maximum number of iterations is reached.
III. E XPERIMENTAL R ESULTS
RVM and SVM classification methods have been applied to
a sample hyperspectral image which is taken over northwest
Indianas Indian Pine test site in June 1992 [20], as the ground
truth classification result of this image is already available. The
data consist of 145 145 pixels with 220 bands. The number
of spectral bands is initially reduced to 200 by removing bands,
covering water absorption as well as noisy bands. The original
ground truth has actually 16 classes, but some classes have a
very small number of elements, and therefore, nine classes that
have the highest number of elements have been selected and
used to generate 4757 training samples and 4588 test samples,
which are shown in Table I.
The most popular kernels used in SVM and RVM are the
linear, polynomial, and radial basis function (RBF) kernels.
The linear kernel typically shows a lower performance and is
therefore not employed in the provided results. Note that
serves as an inner product coefficient for the polynomial kernel,
whereas it determines the RBF width in the case of the RBF
kernel.
Linear kernel
K(xi , xj ) = xi xj .

(11)

K(xi , xj ) = (xi xj )d .

(12)



K(xi , xj ) = exp
xi xj
2 .

(13)

Polynomial kernel

RBF kernel

The SVM is intrinsically biclass [21], and extending it to


multiclass problems is an ongoing research issue [22], [23].
Because it is computationally more expensive to directly solve
multiclass problems [3], it is common to combine several binary
SVM classifiers for this purpose. In the one-against-all method
[4], each class is trained against the remaining K 1 classes,
where K is the total number of classes. The one-against-all
method has to test K binary decision functions to predict a
sample data point. In the one-against-one method [5], K(K
1)/2 binary classifiers are trained and K(K 1)/2 binary tests
are required to make a final decision. Each outcome gives one
vote to the winning class. The class with the most votes is
selected as the final result. In this letter, we have applied the
second approach for SVM as well as RVM because it typically
provides faster training. Although RVM is theoretically not
limited to binary classifiers, this is of little use in practice,
since the size of the Hessian matrix (used while maximizing the
likelihood and updating the weights) grows with the number of
classes [24].
Table II shows results for the SVM-based classification (with
LIBSVM that uses sequential minimal optimization [25]), and
Table III shows results for the RVM-based classification of
the hyperspectral test image. It is seen from these results, that
for similar classification AC, RVM requires a significantly less
number of RVs; hence, the classification time is considerably
reduced. Comparing the maximum classification AC, it is seen
that RVM provides a slightly lower (about 2%) maximum
classification AC compared to SVM, and the main reason is
possibly that the RVM classifier uses a significantly lower number of RVs compared with the number of SVs used in SVM,
resulting in a more sparse representation, which is analogous to
results presented in the literature. Note that cross validation is
used to obtain the parameters for SVM; also, results are given
for a variety of different parameter combinations. The same
parameters are used for RVM to evaluate its performance.

DEMIR AND ERTRK: HYPERSPECTRAL IMAGE CLASSIFICATION

589

TABLE IV
CLASSIFICATION RESULTS FOR DIFFERENT TRAINING DATA SIZE

Fig. 1. Number of SVs for SVM classification and number of RVs for RVM
classification with respect to the number of features.

Fig. 2. SVM and RVM classification ACs with respect to the number of
features.

To compare the performance of RVM and SVM classifications with different number of features, the steepest ascent [9],
[26] algorithm is used to reduce the number of features (i.e.,
bands used in classification). Fig. 1 shows the number of SVs
for the SVM classification and RVs for the RVM classification,
and Fig. 2 shows the RVM and SVM classification ACs for
different number of features. It is seen that the classification
AC of RVM and SVM shows a similar behavior under feature
reduction, whereas the RVM always results in a smaller number
of RVs compared with the number of SVs obtained in the SVM.
Note that = 2 and C = 1000 are used for these results.
For comparative evaluation, the one-norm SVM, as presented
in [27], is also utilized for hyperspectral classification, as it
is known to provide a rather simple and sparse classification
method, and a classification AC of 78.81% is obtained. This
value is significantly lower than the accuracies obtained with
SVM and RVM.
Experimental results show that RVM is superior to SVM
in terms of the number of kernel functions that needs to be
used in the classification (i.e., testing) stage. Therefore, RVM is
preferable to SVM in applications that require low complexity
and, possibly, real-time classification with a priori training.
Although RVM is certainly preferable to SVM in terms of
the time performance in the test (classification) phase, it has to
be noted that the training time of RVM is longer than SVM
because the update rules for the hyperparameters depend on
computing the posterior weight covariance matrix. It is shown
in [17] that the training time of RVM is about seven to eight
times longer than that of SVM, whereas the testing time of
RVM is about seven to eight times shorter than SVM. It is
however noted in [16] that the increased training time of RVM
is significantly offset by the lack of necessity to perform cross

validation over nuisance parameters. It must also be noted that


the training complexity of RVM in the training phase is due to
the necessity of repeatedly computing and inverting the Hessian
matrix, which for a set of N samples requires O(N 2 ) storage
and O(N 3 ) computation. For large data sets, this makes training
considerably slower than SVM [15].
Table IV shows the effect of the training set size on the SVM
and RVM classifications for = 2 and C = 40. In this case,
classification is carried out with five differently chosen training
sets of the given size, and the median result is presented to avoid
bias. It is seen that RVM always results in a significantly lower
number of RVs compared with the number of SVs in SVM,
at the cost of a slightly lower classification AC. Note that the
number of SVs and RVs always shows the total number used in
the one-against-one classifications with common vectors being
separately counted.

IV. C ONCLUSION
RVM-based hyperspectral image classification is presented
in this letter. It is shown to provide similar classification
AC, with a significantly smaller RV rate and, therefore, much
faster testing time, compared with the SVM-based classification. Hence, the RVM classification is superior to the SVM
classification in terms of sparsity. This makes the RVM-based
hyperspectral classification approach more suitable for applications that require low complexity and, possibly, real-time
classification. However, the classification AC yielded by RVM
is less accurate than SVM, particularly for reduced training
samples.

ACKNOWLEDGMENT
The authors would like to thank R. Johansson for his help
with the RVM, D. Landgrebe for providing the AVIRIS data
[20], C.-J. Lin for the LIBSVM software [25], and the anonymous reviewers whose comments have significantly improved
this letter.
R EFERENCES
[1] B. E. Boser, I. M. Guyon, and V. Vapnik, A training algorithm for optimal
margin classifiers, in Proc. 5th Annu. ACM Workshop Comput. Learn.
Theory, 1992, pp. 144152.
[2] C. Burges, A tutorial on support vector machines for pattern recognition, in Proc. Data Mining and Knowl. Discovery, U. Fayyad, Ed., 1998,
pp. 143.
[3] C.-W. Hsu and C.-J. Lin, A comparison of methods for multiclass
support vector machines, IEEE Trans. Neural Networks, vol. 13, no. 2,
pp. 415425, Mar. 2002.

590

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 4, NO. 4, OCTOBER 2007

[4] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel,


Y. LeCun, U. Muller, E. Sackinger, P. Simard, and V. Vapnik, Comparison of classifier methods: A case study in handwriting digit recognition,
in Proc. Int. Conf. Pattern Recog., 1994, pp. 7787.
[5] S. Knerr, L. Personnaz, and G. Dreyfus, Single-layer learning revisited: A stepwise procedure for building and training a neural network, in Neurocomputing: Algorithms, Architectures and Applications,
J. Fogelman, Ed. New York: Springer-Verlag, 1990.
[6] C. Huang, L. S. Davis, and J. R. G. Townshend, An assessment of support vector machines for land cover classification, Int. J. Remote Sens.,
vol. 23, no. 4, pp. 725749, Feb. 2002.
[7] F. Roli and G. Fumera, Support vector machines for remote-sensing
image classification, Proc. SPIE, vol. 4170, pp. 160166, 2001.
[8] J. A. Gualtieri, S. R. Chettri, R. F. Cromp, and L. F. Johnson, Support vector machine classifiers as applied to AVIRIS data, in Proc.
Summaries 8th JPL Airborne Earth Sci. Workshop, 1999, pp. 217227,
JPL Pub. 99-17.
[9] F. Melgani and L. Bruzzone, Classification of hyperspectral remote sensing images with support vector machines, IEEE Trans. Geosci. Remote
Sens., vol. 42, no. 8, pp. 17781790, Aug. 2004.
[10] Y. Bazi and F. Melgani, Toward an optimal SVM classification system
for hyperspectral remote sensing images, IEEE Trans. Geosci. Remote
Sens., vol. 44, no. 11, pp. 33743385, Nov. 2006.
[11] G. Camps-Valls and L. Bruzzone, Kernel-based methods for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens., vol. 43,
no. 6, pp. 13521362, Jun. 2005.
[12] W. M. Lennon, G. Mercier, and L. Hubert-Moy, Classification of hyperspectral images with nonlinear filtering and support vector machines, in
Proc. IGARSS, 2002, vol. 3, pp. 16701672.
[13] A. A. Farag, R. M. Mohamed, and A. El-Baz, A unified framework for
MAP estimation in remote sensing image segmentation, IEEE Trans.
Geosci. Remote Sens., vol. 43, no. 7, pp. 16171634, Jul. 2005.
[14] P. Mantero, G. Moser, and S. B. Serpico, Partially supervised classification of remote sensing images through SVM-based probability
density estimation, IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3,
pp. 559570, Mar. 2005.

[15] M. E. Tipping, The relevance vector machine, in Advances in Neural


Information Processing Systems, vol. 12, S. A. Solla, T. K. Leen, and
K.-R. Mller, Eds. Cambridge, MA: MIT Press, 2000.
[16] M. E. Tipping, Sparse Bayesian learning and the relevance vector
machine, J. Mach. Learn. Res., vol. 1, pp. 211244, 2001.
[17] W. Liyang, Y. Yongyi, R. M. Nishikawa, M. N. Wernick, and A. Edwards,
Relevance vector machine for automatic detection of clustered microcalcifications, IEEE Trans. Med. Imag., vol. 24, no. 10, pp. 12781285,
Oct. 2005.
[18] D. J. C. MacKay, The evidence framework applied to classification
networks, Neural Comput., vol. 4, no. 5, pp. 720736, 1992.
[19] I. T. Nabney, Efficient training of RBF networks for classification, in
Proc. 9th ICANN, 1999, vol. 1, pp. 210215.
[20] AVIRIS NW Indianas Indian Pines 1992 Data Set. (original files), (ground
truth). [Online]. Available: ftp://ftp.ecn.purdue.edu/biehl/MultiSpec/
92AV3C
[21] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learn.,
vol. 20, no. 3, pp. 273297, Sep. 1995.
[22] D. Anguita, S. Ridella, and D. Sterpi, A new method for multiclass
support vector machines, in Proc. IEEE Int. Joint Conf. Neural Netw.,
Jul. 2004, vol. 1, pp. 407412.
[23] R. Rifkin and A. Klautau, In defense of one-versus-all classification,
J. Mach. Learn. Res., vol. 5, pp. 101141, 2004.
[24] R. Johansson and P. Nugues, Sparse Bayesian classification of predicate arguments, in Proc. 9th Conf. Comput. Natural Language Learn.,
43rd Annu. Meeting Assoc. Comput. Linguistics, Ann Arbor, MI, 2005,
pp. 177200.
[25] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector
Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/
libsvm
[26] S. Serpico and L. Bruzzone, A new search algorithm for feature selection
in hyperspectral remote sensing images, IEEE Trans. Geosci. Remote
Sens., vol. 39, no. 7, pp. 13601367, Jul. 2001.
[27] F. Glen and O. Mangasarian, A feature selection Newton method for
support vector machine classification, Comput. Optim. Appl., vol. 28,
no. 2, pp. 185202, Jul. 2004.

You might also like