1
TIME SERIES FORECASTING
BY USING
WAVELET KERNEL SUPPORT VECTOR MACHINES
LSE Time Series Reading Group
By Ali Habibnia
([email protected])
31 Oct 2012
Outline
2
Introduction to Statistical Learning and SVM
SVM & SVR Formula
Wavelet as a Kernel Function
Study 1:
Forecasting volatility based on wavelet support vector machine,
Written by Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
Study 2:
Suggestion for further research + Q&A
Forecasting Volatility in Financial Markets By Introducing a GAAssisted SVR-Garch Model,
Written by Ali Habibnia
SVMs History
3
The Study on Statistical Learning Theory was
started in the 1960s by Vladimir Vapnik. He is
well-known as a founder (together with Professor
Alexey Chervonenkis) of this theory.
He has also developed the theory of the support
vector machines (for linear and nonlinear input
output knowledge discovery) in the framework of
statistical learning theory in 1992.
Prof. Vapnik has been awarded the 2012
Benjamin Franklin medal in Computer and
Cognitive Science from the Franklin Institute.
History and motivation
4
SVMs (a novel ANN) is a supervised learning algorithm for
Pattern Recognition
Regression Estimation Non Parametric (Applications for function estimation started ~ 1995 called
Support Vector Regression)
Remarkable characteristics of SVMs
Good generalization performance: SVMs implement the Structural Risk Minimization Principle
which seeks to minimize the upper bound of the generalization error rather than only minimize
the training error.
Absence of local minima: Training SMV is equivalent to solving a linearly constrained
quadratic programming problem. Hence the solution of SVMs is unique and globally optimal.
It has a simple geometrical interpretation in a high-dimensional feature space
that is nonlinearly related to input space
The Advantages of SVM(R)
5
Based on a strong and nice Theory:
In contrast to previous black box learning approaches, SVMs allow for some intuition
and human understanding.
Training is relatively easy:
No local optimal, unlike in neural network
Training time does not depend on dimensionality of feature space, only on fixed input
space thanks to the kernel trick.
Generally avoids over-fitting:
Trade-off between complexity and error can be controlled explicitly.
Generalize well even in high dimensional spaces under small training set
conditions. Also it is robust to noise.
Linear Classifiers
g(x) is a linear function:
x2
wT x + b > 0
g (x) = wT x + b
A hyper-plane in the feature
space
(Unit-length) normal vector of the
hyper-plane:
n=
w
w
wT x + b < 0
x1
Linear Classifiers
denotes +1
How would you classify these
points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!
x2
denotes -1
x1
Linear Classifiers
denotes +1
How would you classify these
points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!
x2
denotes -1
x1
Linear Classifiers
denotes +1
How would you classify these
points using a linear discriminant
function in order to minimize the
error rate?
Infinite number of answers!
x2
denotes -1
x1
Linear Classifiers
denotes +1
x2
denotes -1
Which one is the best?
x1
Large Margin Linear Classifier
The linear discriminant function
(classifier) with the maximum
margin is the best
Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
Why it is the best?
q
Robust to outliners and thus
strong generalization ability
x2
safe zone
Margin
x1
Large Margin Linear Classifier
Given a set of data points:
{(xi , yi )}, i = 1, 2, L , n
T
For yi = +1, w xi + b > 0
x2
safe zone
Margin
n
For yi = 1, wT xi + b < 0
n
With a scale transformation on
both w and b, the above is
equivalent to
T
For yi = +1, w xi + b 1
For yi = 1, wT xi + b 1
Support Vectors
x1
Large Margin Linear Classifier
We know that
wT x + + b = 1
x2
safe zone
Margin
n
wT x + b = 1
n
The margin width is:
M = (x + x ) n
w
2
= (x + x )
=
w
w
Support Vectors
x1
Large Margin Linear Classifier
x2
Formulation:
maximize
2
w
safe zone
Margin
n
such that
For yi = +1, wT xi + b 1
For yi = 1, wT xi + b 1
Support Vectors
x1
This is the simplest kind of SVM (Called
an LSVM)
x2
Formulation:
1
minimize
w
2
safe zone
2
Margin
n
such that
For yi = +1, wT xi + b 1
For yi = 1, wT xi + b 1
such that
T
yi ( w xi + b) 1
Support Vectors
x1
The Optimization Problem Solution
16
Quadratic
programming
with linear
constraints
minimize
s.t.
1
w
2
yi ( w T xi + b) 1
Lagrangian
Function
1
minimize L p ( w, b, i ) =
w
2
s.t.
i ( yi ( w T xi + b) 1)
i 0
i =1
The Optimization Problem Solution
17
n
1
2
minimize Lp (w, b, i ) = w i ( yi ( wT xi + b) 1)
2
i =1
s.t.
Lp
w
Lp
b
i 0
n
=0
w = i yi xi
i =1
=0
y
i
i =1
=0
The Optimization Problem Solution
18
The Optimization Problem Solution
19
From KKT condition, we know:
x2
i ( yi (wT xi + b) 1) = 0
n
n
Thus, only support vectors have
x+
i 0
The solution has the form:
n
w = i yi xi =
i =1
x+
x-
Support Vectors
T
get
b
from
y
(
w
xi + b) 1 = 0,
y
x
i i i
i
iSV
where xi is support vector
x1
Soft Margin Classification
20
Slack variables i can be added to allow misclassification of
difficult or noisy examples.
11
2
What should our quadratic
optimization criterion be?
Minimize
R
1
w.w + C k
2
k =1
Hard Margin v.s. Soft Margin
21
The old formulation:
Find w and b such that
(w) = wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) 1
The new formulation incorporating slack variables:
Find w and b such that
(w) = wTw + Ci is minimized and for all {(xi ,yi)}
yi (wTxi + b) 1- i and i 0 for all i
Parameter C can be viewed as a way to control overfitting.
Non-linear SVMs: Feature spaces
22
General idea: the original input space can always be mapped to some
higher-dimensional feature space where the training set is separable:
: x (x)
The Kernel Trick
23
The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation
: x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
n
A kernel function is some function that corresponds to an inner product in some
expanded feature space.
Kernel methods map the data into higher dimensional spaces in the hope that in this
higher-dimensional space the data could become more easily separated or better
structured.
The Kernel Trick
24
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
This mapping function, however, hardly needs to be computed because of a tool
called the kernel trick.
The kernel trick is a mathematical tool which can be applied to any algorithm which
solely depends on the dot product between two vectors. Wherever a dot product is
used, it is replaced by a kernel function.
Linear Kernel
Polynomial Kernel
Gaussian Kernel
Exponential Kernel
Laplacian Kernel
ANOVA Kernel
Hyperbolic Tangent (Sigmoid) Kernel
Rational Quadratic Kernel
Multiquadric Kernel
Inverse Multiquadric Kernel
Circular Kernel
Spherical Kernel
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Wave Kernel
Power Kernel
Log Kernel
Spline Kernel
B-Spline Kernel
Bessel Kernel
Cauchy Kernel
Chi-Square Kernel
Histogram Intersection Kernel
Generalized Histogram Intersection Kernel
Generalized T-Student Kernel
Bayesian Kernel
Wavelet Kernel
What Functions are Kernels?
25
For some functions K(xi,xj) checking that
K(xi,xj)= (xi) T(xj) can be cumbersome.
n Mercers theorem:
Every semi-positive definite symmetric function is a kernel
n Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
n
K(x1,x1)
K(x1,x2)
K(x1,x3)
K(x2,x1)
K(x2,x2)
K(x2,x3)
K(xN,x1)
K(xN,x2)
K(xN,x3)
K(x1,xN)
K(x2,xN)
K(xN,xN)
Examples of Kernel Functions
26
K (xi , x j ) = xTi x j
Linear kernel:
Polynomial kernel:
Gaussian (Radial-Basis Function (RBF) ) kernel:
K (xi , x j ) = (1 + xTi x j ) p
K (xi , x j ) = exp(
q
Sigmoid:
xi x j
2
T
0 i
K (xi , x j ) = tanh( x x j + 1 )
Nonlinear SVM: Optimization
27
Some Issues
28
Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
Choice of kernel parameters
- e.g. in Gaussian kernel
- is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a validation set or
cross-validation to set such parameters.
Optimization criterion Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested
Support vector regression
29
Maximum margin hyperplane only applies to classification
However, idea of support vectors and kernel functions can be used for
regression
Basic method is the same as in linear regression: minimize error
Difference A: ignore errors smaller than e and use absolute error instead
of squared error
Difference B: simultaneously aim to maximize flatness of function
User-specified parameter e defines tube
Ordinary Least Squares (OLS)
Solution:
f(x)
Loss = (wX + b ) Y
f (x) = wx + b
dLoss
=0
dw
(X
x
X w = X TY
Support Vector Regression (SVR)
Solution:
f(x)
f (x) = wx + b
+
0
-
Min
1 T
w w
2
Constraints:
yi wT xi b
wT xi + b yi
Support Vector Regression (SVR)
Minimise:N
f(x)
f (x) = wx + b
+
0
-
1 T
w w + C i + i*
2
i =1
Constraints:
T
yi w xi b + i
wT xi + b yi + i*
i , i* 0
*
x
Lagrange Optimisation
N
1 T
L = w w + C i + i*
2
i =1
Target
i + i yi + wT xi + b
i =1
N
i* + i* yi + wT xi + b
i =1
N
i i + *i i*
i =1
Regression:
)
)
y (x ) = i i* xi , x + b
i =1
Constraints
Nonlinear Regression
f(x)
f(x)
+
0
-
+
0
-
(x)
Regression Formulas
N
Linear:
i =1
Nonlinear:
General:
y (x ) = i i* (xi ), (x ) + b
i =1
y (x ) = i i* xi , x + b
y (x ) = i i* K (xi , x ) + b
i =1
Kernel Types
l
l
Linear:
K ( x, xi ) = x, xi
Polynomial:
K ( x, xi ) = x, xi d
Radial basis function:
x
i
K ( x, xi ) = exp
2
Exponential RBF:
x xi
K ( x, xi ) = exp
Wavelet Kernel
37
The Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:
Where a and c are the wavelet dilation and translation coefficients, respectively (the form
presented above is a simplification). A translation-invariant version of this kernel can be
given as:
Where in both h(x) denotes a mother wavelet function. i.e:
A simple View of Wavelet Theory
38
Fourier Analysis
n
Breaks down a signal into constituent sinusoids of different
frequencies
In other words: Transform the view of the signal from timebase to frequency-base.
A simple View of Wavelet Theory
39
Whats wrong with Fourier?
n
n
n
By using Fourier Transform , we loose the time information :
WHEN did a particular event take place ?
FT can not locate drift, trends, abrupt changes, beginning
and ends of events, etc.
Calculating use complex numbers.
Wavelets vs. Fourier Transform
In Fourier transform (FT) we represent a signal in
terms of sinusoids
FT provides a signal which is localized only in the
frequency domain
It does not give any information of the signal in the
time domain
40
Wavelets vs. Fourier Transform
Basis functions of the wavelet transform (WT) are
small waves located in different times
They are obtained using scaling and translation of a
scaling function and wavelet function
Therefore, the WT is localized in both time and
frequency
41
Wavelets vs. Fourier Transform
If a signal has a discontinuity, FT produces many
coefficients with large magnitude (significant
coefficients)
But WT generates a few significant coefficients
around the discontinuity
Nonlinear approximation is a method to benchmark
the approximation power of a transform
42
Wavelets vs. Fourier Transform
In nonlinear approximation we keep only a few significant
coefficients of a signal and set the rest to zero
Then we reconstruct the signal using the significant
coefficients
WT produces a few significant coefficients for the signals
with discontinuities
Thus, we obtain better results for WT nonlinear
approximation when compared with the FT
43
Wavelets vs. Fourier Transform
Most natural signals are smooth with a few discontinuities
(are piece-wise smooth)
Speech and natural images are such signals
Hence, WT has better capability for representing these
signal when compared with the FT
Good nonlinear approximation results in efficiency in
several applications such as compression and denoising
44
Study 1: Forecasting volatility based on wavelet support vector
machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
45
combine SVM with wavelet theory to construct a
multidimensional wavelet kernel function to predict the
conditional volatility of stock market returns based on GARCH
model.
General Kernel function in SVM cannot capture the cluster
feature of volatility accurately.
wavelet function yields features that describe of the volatility
time series both at various locations and at varying time
granularities.
The prediction performance of SVM is greatly dependent upon
the selection of kernel functions.
j and k denote the dilation and
translation
Study 1: Forecasting volatility based on wavelet support vector
machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
46
Study 1: Forecasting volatility based on wavelet support vector
machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
47
Study 1: Forecasting volatility based on wavelet support vector
machine, Ling-Bing Tang, Ling-Xiao Tang, Huan-Ye Sheng
48
Study 2: Forecasting Volatility in Financial Markets By Introducing
a GA-Assisted SVR-Garch Model, Ali Habibnia
49
No structured way being available to choose the free
parameters of SVR and kernel function, these parameters are
usually set by researcher in trial and error (Grid Search and
Cross Validation) procedure which is not optimal.
In this study a novel method, named as GA assisted SVR has
been introduced, which a genetic algorithm simultaneously
searches for SVRs optimal parameters and kernel parameter
(in this study: a radial basis function (RBF)).
The SVM(R) tries to get the best fit with the data, not relying on
any prior knowledge, and it only concentrates on minimizing the
prediction error with a given machine complexity.
FTSE 100 index from 04
Jan 2005 to 29 Jun 2012.
Study 2: Forecasting Volatility in Financial Markets By Introducing
a GA-Assisted SVR-Garch Model, Ali Habibnia
50
51
Suggestion for further research + Q&A
52
Thanks for your patience ;)
LSE Time Series Reading Group
By Ali Habibnia
[email protected]
31 Oct 2012