Data Mining Algorithms In R/Classification/penalizedSVM

Jump to navigation Jump to search

Introduction

Classifiers are some of the most common data analysis tools. There are a lot of implemented techniques, but we may point SVM (Support Vector Machine) as one of the most powerful, especially in high-dimension data. The most well known SVM algorithm was created by Vladimir Vapnik. [1]

The standard SVM implementation SVM takes a input dataset and, for each given input, predicts which of two possible classes the input set belongs to. That's most common use the algorithm to predict if the input belongs to certain dichotomy, or not. Because of this characteristic, SVM is a called a non-probabilistic binary linear classifier.

On Machine Learning-based algorithms such as SVM, the input data has to be separated on two sets: a training set and a test set. The difference between the training and the test set is that, on the training the examples' classes are known beforehand. The test set contains the examples that should have their classes predicted. Given a set of training examples, an SVM algorithm builds a model that predicts what are the categories of the test set's examples.

Representing the examples in a d-dimension space, the model built by the algorithm is a hyper-plane that separates the examples belonging to the dichotomy from the ones that aren't. New examples are then mapped into that space so we can predict the categories they belong, based on the side of the gap it falls on.

The technique described here is a variation of the standard SVM using penalty functions. The technique is implemented on the R-package called penalized SVM, that has smoothly clipped absolute deviation (SCAD), 'L1-norm', 'Elastic Net' ('L1-norm' and'L2-norm') and 'Elastic SCAD' (SCAD and 'L2-norm') as available penalties.

Technique/Algorithm

Algorithm

{\displaystyle {\begin{aligned}&\quad \;{\text{Given a training dataset }}(x_{1},y_{1}),...,(x_{n},y_{n}),{\text{ where }}x_{i}{\text{ is a }}d{\text{-tuple of the }}d{\text{ input}}\\&{\text{parameters }}i{\text{, and }}y_{i}\in -1,1{\text{ where }}y_{i}=1{\text{ means }}i{\text{ belongs to the dichotomy, and }}y_{i}=-1,\\&{\text{the opposite. The SVM divides the space by a linear boundary:}}\\\\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \ \ f(x)={\begin{matrix}\sum _{j=1}^{d}w_{j}h_{j}+b\end{matrix}}\\\\&\quad \ \;{\text{where }}w=(w_{1},...,w_{d}){\text{ are the coefficients of the hyper-plan and }}b{\text{ denotes the in-}}\\&{\text{tercept of the hyperplane. The output for an example from the test set }}x_{test}{\text{ would be}}\\&y_{test}=sign[f(x_{test)}].{\text{ So, the example belongs to the dichotomy if }}f(x_{test})>0.\end{aligned}}}

Finding the best model to classify new examples is, as a matter of fact, similar to the problem of finding the optimal hyperplane. The quality of a model built by the algorithm is measured by the margin of its hyperplane. The margin is the distance between the hyperplanes wx+b = 1 and wx+b = -1,being so that the closest points to wx+b = 0, in each side of that hyperplane are crossed by the other two hyperplanes. The best hyperplanes are those with the biggest margins.

The problem of finding the optimal hyperplane with maximal margin is solved by convex optimization. Maximizing the margin can be achieved by solving:

${\displaystyle \min _{b,w}\sum \left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\operatorname {pen} _{\lambda }\left(w\right)}$

The term

${\displaystyle pen_{\lambda }(w)=\lambda {\begin{Vmatrix}w\end{Vmatrix}}_{2}^{2}}$

for SVM implementation has the form of L2 norm ('ridge penalty'). This penalty causes the reduction of the coefficients, but not always to values greater than zero.

The quality of a model can also be measured by the error computed for the predictions on the training set. A low prediction is required, but reducing it to zero may cause a problem know as over fitting. This means that the model built classifies very well the examples on training set, but is not appropriate to classify the ones on the test set. The predictions made in this case, would usually be wrong for the new examples.

Not only the low prediction defines a model's quality. It is possible to improve it further by identifying covariates that plays important roles on discrimination and assessing their contribution to the classifier. This can be achieved by applying a feature selection method.

Feature selection methods are divided into two classes: filter and wrapper methods[2]. Filter methods drop irrelevant features before the model is built by the algorithm. Wrapper methods increases the prediction power by providing the selection within the optimization procedure.

The R package 'penalizedSVM' provides some feature selection methods. One of them is the wrapper feature selection SCAD (Smoothly Clipped Absolute Deviation). SCAD is a non-convex penalty function first proposed by Fan [3] and discussed in [4].On [5] SVM is combined with SCAD for feature selection. The penalization term for SCAD SVM has the form:

${\displaystyle pen_{\lambda }(w)=\textstyle \sum _{j=1}^{d}p_{\lambda }(w_{j})}$

where the SCAD penalty function for each coefficient wj is defined as:

${\displaystyle p_{\lambda }(w_{j})={\begin{cases}\lambda \left\vert w_{j}\right\vert &{\text{if }}\left\vert w_{j}\right\vert \leq \lambda \\-{\frac {\left\vert w_{j}\right\vert ^{2}-2a\alpha \left\vert w_{j}\right\vert +\lambda ^{2}}{2(a-1)}}&{\text{if }}\left\vert w_{j}\right\vert \leq a\lambda \\{\frac {a+1/\lambda ^{2}}{2}}&{\text{if }}\left\vert w_{j}\right\vert >a\lambda \end{cases}}}$

with tuning parameters a > 2 (in the package, a = 3.7) and λ > 0. pλ (w) corresponds to a quadratic spline function with knots at λ and aλ.

For small coefficients, the SCAD has the same behavior as the L1. For large coefficients, however, the SCAD applies a constant penalty, in contrast to the L1 penalty, which increases linearly as the coefficient increases. This absolute maximum of the SCAD penalty, which is independent from the input data, decreases the possible bias for estimating large coefficients.

Implementation

The package described in this section is penalizedSVM. This package provides feature selection SVM using penalty functions. The smoothly clipped absolute deviation (SCAD), 'L1-norm', 'Elastic Net' ('L1-norm' and 'L2-norm') and 'Elastic SCAD'(SCAD and 'L2-norm') penalties are available. The tuning parameters can be found using either a fixed grid or a interval search.

This package has several dependencies. The packages that need to be installed and theirs descriptions are listed below:

• e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support

vector machines, shortest path computation, bagged clustering, naive Bayes classifier, ...

• MASSFunctions and datasets to support Venables and Ripley, 'Modern Applied Statistics with S' (4th

edition).

• corpcorThis package implements a James-Stein-type shrinkage estimator for the covariance matrix, with

separate shrinkage for variances and correlations. The details of the method are explained in Sch\"afer and Strimmer (2005) and Opgen-Rhein and Strimmer (2007).

• stadmod Various statistical modeling functions including growth curve comparisons, limiting dilution

analysis, mixed linear models, heteroscedastic regression, Tweedie family generalized linear models, the inverse-Gaussian distribution and Gauss quadrature.

• tgp Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes

with jumps to the limiting linear model (LLM).

• mlepg Maximum likelihood Gaussian process modeling for univariate and multi-dimensional outputs with

diagnostic plots. Contact the maintainer for a package version that implements sensitivity analysis functionality.

• lhs This package provides a number of methods for creating and augmenting Latin Hypercube Samples

The major function on penalizedSVM is svm.fs. Its use can be described as follow:

## Default S3 method:
svm.fs(
x,
y,
fs.method = c("scad", "1norm", "scad+L2", "DrHSVM"),
### tuning parameter settings
# chose the search method for tuning lambda1,2: 'interval' or 'discrete'
grid.search=c("interval","discrete"),
#fixed grid for lambda1, lambda2
lambda1.set=NULL,
lambda2.set=NULL,
# define range for lambda1,2 for interval search
bounds=NULL,
# parms.coding="none" or "log2"
parms.coding= c("log2","none"),
# internal parameter for DIRECT
maxevals=500,
### valuidation settings
# fot nested validation, 'cross.outer'-fold cv
#cross.outer= 0,
# method for the inner validation: cross validation, gacv
inner.val.method = c("cv", "gacv"),
# 'cross.inner'-fold cv
cross.inner= 5,
# show plots in Direct?
show= c("none", "final"),
### other settings
# internal parameter for svm
calc.class.weights=FALSE,
class.weights=NULL,
#seed
seed=123,
# max iterations for the feature selection svm method
maxIter=700,
# verbose?
verbose=TRUE,
...)


where the arguments are:

• x: input matrix with genes in columns and samples in rows!
• y: numerical vector of class labels, -1 , 1
• fs.method: feature selection method. Availible ’scad’, ’1norm’ for 1-norm, "DrHSVM" for Elastic Net and "scad+L2" for Elastic SCAD
• grid.search: chose the search method for tuning lambda1,2: ’interval’ or ’discrete’, default: ’interval’
• lambda1.set: for fixed grid search: fixed grid for lambda1, default: NULL
• lambda2.set: for fixed grid search: fixed grid for lambda2, default: NULL
• bounds: for interval grid search: fixed grid for lambda2, default: NULL
• parms.coding: for interval grid search: parms.coding: none or log2 , default: log2
• maxevals: the maximum number of DIRECT function evaluations, default: 500.
• cross.outer: fold of outer cross validation, default is 0, no cv.
• calc.class.weights: calculate class.weights for SVM, default: FALSE
• class.weights: a named vector of weights for the different classes, used for asymmetric class

sizes. Not all factor levels have to be supplied (default weight: 1). All components have to be named.

• inner.val.method: method for the inner validation: cross validation, gacv , default cv
• cross.inner: ’cross.inner’-fold cv, default: 5
• show: for interval search: show plots of DIRECT algorithm: none, final iteration, all iterations. Default: none
• seed: seed
• maxIter: maximal iteration, default: 700
• verbose: verbose?, default: TRUE
• ... additional: argument(s)

View

To visualize the result of the algorithm, you can use the function show, as exampled above.

Case Study

In this section, we illustrate a case study with penalizedSVM:

Scenario

The objective is to tell if a picture belongs to the same category of another one selected previously. For that purpose, we use the difference between the values of those characteristics. The attributes for the data used as input represents the difference from the value of the characteristic in the picture selected and the ones we want to classify.

Dataset

The training set and the test set were generated using the commands shown below:

> train<-sim.data(n = 200, ng = 100, nsg = 10, corr=FALSE, seed=seed )
> print(str(train))
List of 3
$x : num [1:100, 1:200] -0.5605 2.1988 -0.0736 1.074 0.3563 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:100] "pos1" "pos2" "pos3" "pos4" ...
.. ..$: chr [1:200] "1" "2" "3" "4" ...$ y   : Named num [1:200] 1 -1 1 1 1 -1 -1 -1 1 -1 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
$seed: num 123 NULL > test<-sim.data(n =20, ng = 100, nsg = 10, corr=FALSE, seed=seed+1 ) > print(str(test)) List of 3$ x   : num [1:100, 1:20] -1.3851 -1.1036 -0.2677 0.2836 -0.0951 ...
..- attr(*, "dimnames")=List of 2
.. ..$: chr [1:100] "pos1" "pos2" "pos3" "pos4" ... .. ..$ : chr [1:20] "1" "2" "3" "4" ...
$y : Named num [1:20] -1 1 -1 1 1 1 1 1 1 -1 ... ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...$ seed: num 124
NULL


Execution

To build the model, the following command were used:

> bounds=t(data.frame(log2lambda1=c(-10, 10)))
>colnames(bounds)<-c("lower", "upper")
# computation intensive; for demostration reasons only for the first 100 features
# and only for 10 Iterations maxIter=10, default maxIter=700
>system.time( scad<- svm.fs(t(train$x)[,1:100], y=train$y, fs.method="scad", bounds=bounds,
+ cross.outer= 0, grid.search = "interval", maxIter = 10,
+ inner.val.method = "cv", cross.inner= 5, maxevals=500,
+ seed=seed, parms.coding = "log2", show="none", verbose=FALSE ) )


Output

To see the model created:

> print(str(scad$model)) List of 11$ w       : Named num [1:23] 0.625 0.616 0.353 0.258 0.959 ...
..- attr(*, "names")= chr [1:23] "pos1" "pos2" "pos3" "pos4" ...
$b : num -0.115$ xind    : int [1:23] 1 2 3 4 5 6 7 8 9 10 ...
$index : int [1:83] 3 4 9 14 17 18 22 35 37 40 ...$ fitted  : num [1:200] 2.6 1.24 0.65 1 1.15 ...
$type : num 0$ lambda1 : num 0.126
$lambda2 : NULL$ iter    : num 10
$q.val : num 0.195$ fit.info:List of 13
..$fmin : num 0.195 ..$ xmin       : Named num -2.99
.. ..- attr(*, "names")= chr "log2lambda1"
..$iter : num 26 ..$ neval      : num 46
..$maxevals : num 500 ..$ seed       : num 123
..$bounds : num [1, 1:2] -10 10 .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr "log2lambda1"
.. .. ..$: chr [1:2] "lower" "upper" ..$ Q.func     : chr ".calc.scad"
..$points.fmin:'data.frame': 1 obs. of 2 variables: .. ..$ log2lambda1: num -2.99
.. ..$f : num 0.195 ..$ Xtrain     : num [1:46, 1] -7.52 -2.26 -1.34 9.99 9.03 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$: NULL .. .. ..$ : chr "log2lambda1"
..$Ytrain : num [1:46] 3.65e-01 3.20e-01 4.60e-01 1.00e+16 1.00e+16 ... ..$ gp.seed    : num [1:25] 123 124 125 126 127 128 129 130 131 132 ...
..$model.list :List of 1 .. ..$ model:List of 10
.. .. ..$w : Named num [1:23] 0.625 0.616 0.353 0.258 0.959 ... .. .. .. ..- attr(*, "names")= chr [1:23] "pos1" "pos2" "pos3" "pos4" ... .. .. ..$ b               : num -0.115
.. .. ..$xind : int [1:23] 1 2 3 4 5 6 7 8 9 10 ... .. .. ..$ index           : int [1:83] 3 4 9 14 17 18 22 35 37 40 ...
.. .. ..$fitted : num [1:200] 2.6 1.24 0.65 1 1.15 ... .. .. ..$ type            : num 0
.. .. ..$lambda1 : num 0.126 .. .. ..$ iter            : num 10
.. .. ..$q.val : num 0.195 .. .. ..$ inner.val.method: chr "cv"
NULL


To predict a class for the examples on the test set:


>(scad.5cv.test<-predict.penSVM(scad, t(test$x)[,1:100], newdata.labels=test$y) )
$pred.class [1] -1 1 -1 -1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 1 -1 -1 Levels: -1 1$fitted
[,1]
1  -2.5344366
2   2.3440943
3  -1.3972349
4  -0.3613470
5  -2.1187284
6   1.1287477
7   2.5584662
8   1.9155333
9   1.5543941
10 -0.7128084
11 -1.6944994
12 -0.2943272
13  1.8497781
14  2.7800572
15  0.8927699
16 -0.1289518
17  2.4560094
18  0.8756835
19 -2.2114729
20 -1.7342811

$tab newdata.labels pred.class -1 1 -1 6 4 1 1 9$error
[1] 0.25

$sensitivity [1] 0.6923077$specificity
[1] 0.8571429

> test<-sim.data(n = 20, ng = 100, nsg = 10, corr=FALSE, seed=seed+1 )
> print(str(test))
List of 3
$x : num [1:100, 1:20] -1.3851 -1.1036 -0.2677 0.2836 -0.0951 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:100] "pos1" "pos2" "pos3" "pos4" ...
.. ..$: chr [1:20] "1" "2" "3" "4" ...$ y   : Named num [1:20] -1 1 -1 1 1 1 1 1 1 -1 ...
..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
$seed: num 124 NULL > (scad.5cv.test<-predict.penSVM(scad, t(test$x)[,1:100], newdata.labels=test$y))$pred.class
[1] -1 1  -1 -1 -1 1  1  1  1  -1 -1 -1 1  1  1  -1 1  1  -1 -1
Levels: -1 1

$fitted [,1] 1 -2.5344366 2 2.3440943 3 -1.3972349 4 -0.3613470 5 -2.1187284 6 1.1287477 7 2.5584662 8 1.9155333 9 1.5543941 10 -0.7128084 11 -1.6944994 12 -0.2943272 13 1.8497781 14 2.7800572 15 0.8927699 16 -0.1289518 17 2.4560094 18 0.8756835 19 -2.2114729 20 -1.7342811$tab
newdata.labels
pred.class -1 1
-1  6 4
1   1 9

$error [1] 0.25$sensitivity
[1] 0.6923077

$specificity [1] 0.8571429  Analysis To analyse the results, the follow commands can be used: > print(paste("minimal 5-fold cv error:", scad$model$fit.info$fmin,
+ "by log2(lambda1)=", scad$model$fit.info$xmin)) [1] "minimal 5-fold cv error: 0.195 by log2(lambda1)= -2.99093721912059" > print(" all lambdas with the same minimum? ") [1] " all lambdas with the same minimum? " > print(scad$model$fit.info$ points.fmin)
log2lambda1     f
36   -2.990937 0.195
> print(paste(scad$model$fit.info$neval, "visited points")) [1] "46 visited points" #Plot the results >.plot.EPSGO.parms (scad$model$fit.info$Xtrain, scad$model$fit.info\$Ytrain,
+ bound=bounds, Ytrain.exclude=10^16, plot.name=NULL )


References

1. Corinna Cortes and V. Vapnik, Support-Vector Networks, Machine Learning, 20, 1995.http://www.springerlink.com/content/k238jx04hm87j80g/
2. Blum A and Langley P, Selection of relevant features and examples in machine learning. Artif. Intell. 1997;97:245-271.
3. Fan J.,Comments on ‘Wavelets in Statistics: A Review’ by A. Antoniadis. and J. Italian Stat.Assoc. 1997;6:131-138.
4. Fan J.,Comments on ‘Wavelets in Statistics: A Review’ by A. Antoniadis. and J. Italian Stat.Assoc. 1997;6:131-138.
5. Zhang HH, et al,Gene selection using support vector machines with nonconvex penalty.Bioinformatics 2006;22:88-95.