Data Mining Algorithms In R/Dimensionality Reduction/Singular Value Decomposition

From Wikibooks, open books for an open world
Jump to navigation Jump to search

In this chapter we will take a look at Singular Value Decomposition (SVD), a matrix's factorization method that uses the knowledge of Linear Algebra in order to make such decompositions.

Singular Value Decomposition[edit | edit source]

In data mining, this algorithm can be used to better understand a database by showing the number of important dimensions and also to simplify it, by reducing of the number of attributes that are used in a data mining process. This reduction removes unnecessary data that are linearly dependent in the point of view of Linear Algebra. For example, imagine a database which contains a field that stores the water's temperature on several samples and another that stores its state (solid,liquid or gas). Its easy to see that the second field is dependent from the first and, therefore, SVD could easily show us that it is not important for the analysis.

Principal Component Analysis (PCA) is a specific case of SVD.

Algorithm[edit | edit source]

SVD is the factorization of a matrix of real or complex numbers, that has rows and columns, into:

where is a matrix whose dimensions are , is another matrix whose dimensions are , and is a matrix whose dimensions are , the same dimensions as .

Besides:

and

where and are Identity matrix whose size are respectively and .

The columns of are the left singular vectors of the matrix , and the columns of (or the rows of ) are the right singular vectors.

The matrix is a diagonal matrix, whose diagonal values are the singular values of the matrix . The singular value in a row of is never less than the value of a row below. All singular values are greater than .

To compute the SVD is to find the eigenvalues and the eigenvectors of and . The eigenvectors of are the columns of and the eigenvectors of are the columns of . The singular values of , in the diagonal of matrix , are the square root of the common positive eigenvalues of and .

If and have the same number of eigenvalues, is a square matrix; else, the eigenvalues of the matrix that has less eigenvalues are eigenvalues of the matrix that has more eigenvalues. Therefore, the singular values of are the eigenvalues of the matrix, between and , that has less eigenvalues.

The number of singular values of a matrix is the rank of that matrix, that is the number of linearly independent columns or rows of a matrix. The rank is not greater than , because this is the number of elements of the diagonal of the matrix. The singular values are elements of the diagonal of the matrix . The number of positive singular values equals the rank of the matrix.

Therefore, the algorithm is:

  1. Compute normally.
  2. Compute the eigenvalues and the eigenvectors of normally.
  3. Compute .
  4. Compute the eigenvalues and the eigenvectors of .
  5. Compute the square root of the common positive eigenvalues of and .
  6. Finally, assign the computed values to , and .

Example[edit | edit source]

X =


XXT =

The eigenvalues of XXT are:

12.744563, 4.000000, and 1.255437

The eigenvectors of XXT are:

  1. 0.5417743, 0.5417743, 0.6426206
  2. 0.7071068, -0.7071068, -2.144010 x 10-16
  3. 0.4544013, 0.4544013, -0.7661846

XTX =

The eigenvalues of XTX are:

12.744563, 4.000000, 1.255437, and (5.940557 x 10-18)

The eigenvectors of XTX are:

  1. 0.6635353, 0.3035190, 0.4835272, 0.4835272
  2. 0.0000000, 0.5181041 x 10-16, -0.7071068, 0.7071068
  3. -0.5565257, 0.8110957, 0.1272850, 0.1272850
  4. 0.5000000, 0.5000000, -0.5000000, -0.5000000

The singular values of X are the square root of the common eigenvalues of XTX and XXT:

Therefore:

A =

Finally, X is decomposed:

X = UAVT =

Implementation[edit | edit source]

R has a built in function which calculates SVD, called 'svd()'.

It, by default, receives a R's native matrix as argument and returns a frame, that contains U, A and V.

Arguments[edit | edit source]

If it is not necessary that all singular values and vectors are computed, one can tell svd() the exact number of needed elements.

This can be achieved by assigning these values to nu and nv which represent the number of left and right singular vectors needed, respectively.

For example, in order to calculate only half of these vectors, one could do:

svd(X, nu = min(nrow(X), ncol(X)) / 2, nv = min(nrow(X), ncol(X))/2)

Returned object[edit | edit source]

s = svd(X)

Considering:

X = UAVT

The returned object is a data structure that contains three fields:

s$d is the vector that contains the singular values of X, that was got from the diagonal of matrix A.

s$u is the matrix whose columns contain the left singular vectors of X. Its number of rows is the same number of rows of X and its number of columns is the number passed to the parameter nu. Note that if nu is 0, this matrix is not created.

s$v is the matrix whose columns contain the right singular vectors of X. Its number of rows is the same number of columns of X and its number of columns is the number passed to the parameter nv. Note that if nv is 0, this matrix is not created.


Examples[edit | edit source]

Execute the following command sequences in the R terminal or as a R program and see the results:

Example 1:

dat <- seq(1,240,2)

X <- matrix(dat,ncol=12)

s <- svd(X)

A <- diag(s$d)

s$u %*% A %*% t(s$v) #  X = U A V'

Example 2:

dat <- seq(1,240,2)

X <- matrix(dat,ncol=12)

s <- svd(X, nu = nrow(X), nv = ncol(X))

A <- diag(s$d)

A <- cbind(A, 0)  # Add two columns with zero, in order to A have the same dimensions of X.

A <- cbind(A, 0)

s$u %*% A %*% t(s$v) #  X = U A V'

Case Study[edit | edit source]

Scenario[edit | edit source]

In order to better visualize the results of SVD, in this case study we will work with a matrix that represents an image, so any change on the matrix can be easily observed.

Input Data[edit | edit source]

To work with an image on R, one should install the package rimage:

> install.packages("rimage")

Let's then load the image into R, converting it to a greyscale one. This way, we will end up with an binary matrix, where 1 means white, and 0 black.

library(rimage)
tux <- read.jpeg('tux.jpeg')
tux <- imagematrix(tux,type='grey')

We can see the result of this import:

plot(tux)

In order to help us with this dimension reduction, lets make a little help function, which will receive our tux and the numbers of dimension we want and return our new tux.

reduce <- function(A,dim) {
    #Calculates the SVD
    sing <- svd(A)

    #Approximate each result of SVD with the given dimension
    u<-as.matrix(sing$u[, 1:dim])
    v<-as.matrix(sing$v[, 1:dim])
    d<-as.matrix(diag(sing$d)[1:dim, 1:dim])

    #Create the new approximated matrix
    return(imagematrix(u%*%d%*%t(v),type='grey'))
}

Execution and output[edit | edit source]

Now that we have our matrix, let's see how many singular values it has:

tux_d <- svd(tux)
length(tux_d$d)
[1] 335

So we have 335 singular values on this matrix. Let's first try to reduce it to only one singular value:

plot(reduce(tux,1))

As we can see, this approximation removes a lot of useful information from our matrix. If it was a database, we surely would have lost important data.

Lets try with 10% (35) singular values:

plot(reduce(tux,35))

Analysis[edit | edit source]

With only 10% of the real data we are able to create a very good approximation of the real data. Moreover, with this method we can remove noises and linear dependent elements by using only the most important singular values. This is very useful on data mining, since is hard to identify if a database is clear and, if not, which elements are useful or not to our analysis.

References[edit | edit source]