10/13/23, 1:10 PM 1.4. Support Vector Machines — scikit-learn 1.3.
1 documentation
1.4. Support Vector Machines
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
The support vector machines in scikit-learn support both dense ( numpy.ndarray and convertible to that by numpy.asarray ) and sparse (any scipy.sparse ) sample
vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-
ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64 .
1.4.1. Classification
SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on a dataset.
10/13/23, 1:10 PM 1.4. Support Vector Machines — scikit-learn 1.3.1 documentation
>>> # get support vectors
>>> clf.support_vectors_
array([[0., 0.],
[1., 1.]])
>>> # get indices of support vectors
>>> clf.support_
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_
array([1, 1]...)
Examples:
SVM: Maximum margin separating hyperplane,
Non-linear SVM
SVM-Anova: SVM with univariate feature selection,
1.4.1.1. Multi-class classification
SVC and NuSVC implement the “one-versus-one” approach for multi-class classification. In total, n_classes * (n_classes - 1) / 2 classifiers are constructed and each
one trains data from two classes. To provide a consistent interface with other classifiers, the decision_function_shape option allows to monotonically transform the
results of the “one-versus-one” classifiers to a “one-vs-rest” decision function of shape (n_samples, n_classes) .
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(decision_function_shape='ovo')
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4
On the other hand, LinearSVC implements “one-vs-the-rest” multi-class strategy, thus training n_classes models.
>>> lin_clf = svm.LinearSVC(dual="auto")
>>> lin_clf.fit(X, Y)
LinearSVC(dual='auto')
10/13/23, 1:10 PM 1.4. Support Vector Machines — scikit-learn 1.3.1 documentation
SVC (but not NuSVC) implements the parameter class_weight in the fit method. It’s a dictionary of the form {class_label : value} , where value is a floating point
number > 0 that sets the parameter C of class class_label to C * value . The figure below illustrates the decision boundary of an unbalanced problem, with and
without weight correction.
SVC, NuSVC, SVR, NuSVR, LinearSVC, LinearSVR and OneClassSVM implement also weights for individual samples in the fit method through
the sample_weight parameter. Similar to class_weight , this sets the parameter C for the i-th example to C * sample_weight[i] , which will encourage the classifier
to get these samples right. The figure below illustrates the effect of sample weighting on the decision boundary. The size of the circles is proportional to the sample
weights:
10/13/23, 1:10 PM 1.4. Support Vector Machines — scikit-learn 1.3.1 documentation
Support Vector Regression (SVR) using linear and non-linear kernels
1.4.3. Density estimation, novelty detection
The class OneClassSVM implements a One-Class SVM which is used in outlier detection.
See Novelty and Outlier Detection for the description and usage of OneClassSVM.
1.4.4. Complexity
Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the number of training vectors. The core of an SVM is a
quadratic programming problem (QP), separating support vectors from the rest of the training data. The QP solver used by the libsvm-based implementation scales
between and depending on how efficiently the libsvm cache is used in practice (dataset dependent). If the data is very sparse should be replaced by the average
number of non-zero features in a sample vector.
For the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost
linearly to millions of samples and/or features.
1.4.5. Tips on Practical Use
Avoiding data copy: For SVC, SVR, NuSVC and NuSVR, if the data passed to certain methods is not C-ordered contiguous and double precision, it will be copied
before calling the underlying C implementation. You can check whether a given numpy array is C-contiguous by inspecting its flags attribute.
For LinearSVC (and LogisticRegression) any input passed as a numpy array will be copied and converted to the liblinear internal sparse data representation
(double precision floats and int32 indices of non-zero components). If you want to fit a large-scale linear classifier without copying a dense numpy C-contiguous
double precision array as input, we suggest to use the SGDClassifier class instead. The objective function can be configured to be almost the same as
the LinearSVC model.
Kernel cache size: For SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. If you have enough RAM
available, it is recommended to set cache_size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
Setting C: C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more
regularization.
LinearSVC and LinearSVR are less sensitive to C when it becomes large, and prediction results stop improving after a certain threshold. Meanwhile,
larger C values will take more time to train, sometimes up to 10 times longer, as shown in [11].
Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector
10/13/23, 1:10 PM 1.4. Support Vector Machines — scikit-learn 1.3.1 documentation
Different kernels are specified by the kernel parameter:
>>> linear_svc = svm.SVC(kernel='linear')
>>> linear_svc.kernel
'linear'
>>> rbf_svc = svm.SVC(kernel='rbf')
>>> rbf_svc.kernel
'rbf'
See also Kernel Approximation for a solution to use RBF kernels that is much faster and more scalable.
1.4.6.1. Parameters of the RBF Kernel
When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma . The parameter C , common to all SVM kernels,
trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying
all training examples correctly. gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.
Proper choice of C and gamma is critical to the SVM’s performance. One is advised to use GridSearchCV with C and gamma spaced exponentially far apart to choose
good values.
Examples:
RBF SVM parameters
Non-linear SVM
1.4.6.2. Custom Kernels
You can define your own kernels by either giving the kernel as a python function or by precomputing the Gram matrix.
Classifiers with custom kernels behave the same way as any other classifiers, except that:
Field support_vectors_ is now empty, only indices of support vectors are stored in support_
A reference (and not a copy) of the first argument in the fit() method is stored for future reference. If that array changes between the use
of fit() and predict() you will have unexpected results.
Using Python functions as kernels
Using the Gram matrix
10/13/23, 1:10 PM 1.4. Support Vector Machines — scikit-learn 1.3.1 documentation
Intuitively, we’re trying to maximize the margin (by minimizing ), while incurring a penalty when a sample is misclassified or within the margin boundary. Ideally, the
value would be for all samples, which indicates a perfect prediction. But problems are usually not always perfectly separable with a hyperplane, so we allow some
samples to be at a distance from their correct margin boundary. The penalty term C controls the strength of this penalty, and as a result, acts as an inverse regularization
parameter (see note below).
The dual problem to the primal is
where is the vector of all ones, and is an by positive semidefinite matrix, , where is the kernel. The terms are called the dual coefficients, and they are upper-
bounded by . This dual representation highlights the fact that training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function :
see kernel trick.
Once the optimization problem is solved, the output of decision_function for a given sample becomes:
and the predicted class correspond to its sign. We only need to sum over the support vectors (i.e. the samples that lie within the margin) because the dual coefficients are
zero for the other samples.
These parameters can be accessed through the attributes dual_coef_ which holds the product , support_vectors_ which holds the support vectors,
and intercept_ which holds the independent term
Note While SVM models derived from libsvm and liblinear use C as regularization parameter, most other estimators use alpha . The exact equivalence between the
amount of regularization of two models depends on the exact objective function optimized by the model. For example, when the estimator used is Ridge regression,
the relation between them is given as .
LinearSVC
NuSVC
1.4.7.2. SVR