R Programming/Factor Analysis

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Introduction[edit | edit source]

Factor analysis is a set of techniques to reduce the dimensionality of the data. The goal is to describe the dataset with a smaller number of variables (ie underlying factors). Factor Analysis was developed in the early part of the 20th century by L.L. Thurstone and others. Correspondence analysis was originally developed by Jean-Paul Benzécri in the 60's and the 70's. Factor analysis is mainly used in marketing, sociology and psychology. It is also known as data mining, multivariate data analysis or exploratory data analysis.

There are three main methods. Principal Component Analysis deals with continuous variables. Correspondence Analysis deals with a contingency table (two qualitative variables) and Multiple correspondence analysis is a generalization of the correspondence analysis with more than two qualitative variables. The major difference between Factor Analysis and Principal Components Analysis is that in FA, only the variance which is common to multiple variables is analysed, while in PCA, all of the variance is analysed. Factor Analysis is a difficult procedure to use properly, and is often misapplied in the psychological literature. One of the major issues in FA (and PCA) is the number of factors to extract from the data. Incorrect numbers of factors can cause difficulties with the interpretation and analysis of the data.

There are a number of techniques which can be applied to assess how many factors to extract. The two most useful are parallel analysis and the minimum average partial criterion. Parallel analysis works by simulating a matrix of the same rank as the data and extracting eigenvalues from the simulated data set. The point at which the simulated eigenvalues are greater than those of the data is the point at which the "correct" number of factors have been extracted. The Minimum Average Partial criterion uses a different approach but can often be more accurate. Simulation studies have established these two methods as the most accurate. Both of these methods are available in the psych package under the fa.parallel and the VSS commands.

Another issue in factor analysis is which rotation (if any) to choose. Essentially, the rotations transform the scores such that they are more easily interpretable. There are two major classes of rotations, orthogonal and oblique. Orthogonal rotations assume that the factors are uncorrelated, while oblique rotations allow the factors to correlate (but do not force this). Oblique rotations are recommended by some (e.g. MacCallum et al 1999) as an orthogonal solution can be obtained from an oblique rotation, but not vice versa.

One of the issues surrounding factor analysis is that there are an infinite number of rotations which explain the same amount of variance, so it can be difficult to assess which model is correct. In response to such concerns, Structural Equation Modelling (SEM), which is also known as Confirmatory Factor Analysis (CFA) was developed by Joreskeg in the 1970's. The essential principle of SEM is that given a model, it attempts to reproduce the observed covariance matrix seen in the data. The ability of a model to reproduce the data can be used as a test of that model's truth. SEM is implemented in R in the sem and lavaan packages, as well as the OpenMx package (which is not available on CRAN).

See the following packages : FactoMineR (website), amap, ade4, anacor, vegan, '"psych"'

Principal Component Analysis (PCA)[edit | edit source]

PCA deals with continuous variables

  • prcomp() in the stats package.
  • princomp() in the stats package.
  • PCA() (FactoMineR)
  • See also factanal()
  • See also fa and prcomp in the psych package
N <- 1000
factor1 <- rnorm(N)
factor2 <- rnorm(N) 
x1 <- rnorm(N) + factor1
x2 <- rnorm(N) + factor1
x3 <- rnorm(N) + factor2 
x4 <- rnorm(N) + factor2
mydat <- data.frame(x1,x2,x3,x4)
pca <- prcomp(mydat)
names(pca)
plot(pca) # plot the eigenvalues
biplot(pca) # A two dimensional plot

pca2 <- princomp(mydat)
biplot(pca2)

pca2 <- princomp(~ x1 + x2 + x3 + x4, data = mydat) # princomp with a formula syntax


Correspondence Analysis (CA)[edit | edit source]

Correspondence analysis is a tool for analyzing contingency tables.

  • corresp() MASS
  • Michael Greenacre's ca package (JSS article)
  • Correspondence Analysis and Related Network (link)
  • Quick-R's page (link)
  • Simple and Canonical Correspondence Analysis Using the R Package anacor (pdf, JSS article)
  • multiv

Multiple Correspondence Analysis (MCA)[edit | edit source]

References[edit | edit source]


Previous: Time Series Index Next: Network Analysis