R Programming/Linear Models

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Standard linear model[edit | edit source]

In this section we present estimation functions for the standard linear model estimated by ordinary least squares (OLS). Heteroskedasticity and endogeneity are treated below. The main estimation function is lm().

Fake data simulations[edit | edit source]

We first generate a fake dataset such that there is no hetereoskedasticity, no endogeneity and no correlation between the error terms. Therefore the ordinary least square estimator is unbiased and efficient. We choose a model with two variables and take all the coefficients equal to one.

> N <- 1000
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- 1 + x1 + rnorm(N)
> y <- 1 + x1 + x2 + u
> df <- data.frame(y,x1,x2)

Least squares estimation[edit | edit source]

  • The standard function to estimate a simple linear model is lm().
  • lsfit() performs the least square procedure but the output is not formatted in fashionable way.
  • ols() (Design) is another alternative.

We estimate the model using lm(). We store the results in fit and print the result using summary() which is the standard function.

> fit <- lm(y ~ x1 + x2, data = df)
> summary(fit)

There are some alternative to display the results.

  • display() in the arm package is one of them.
  • coefplot() (arm) graphs the estimated coefficients with confidence intervals. This is a good way to present the results.
  • mtable() in the memisc package can display the results of a set of regressions in the same table.
> library("arm")
> display(fit)
> coefplot(fit)

fit is a list of objects. You can see the list of these objects by typing names(fit). We can also apply functions to fit.

We can get the estimated coefficients using fit$coeff or coef(fit).

> fit$coeff
(Intercept)          x1          x2 
  1.2026522   0.8427403   1.5146775
> coef(fit)
(Intercept)          x1          x2 
     0.7541      1.7844      0.7222 
> output <- summary(fit)
> coef(output) 
             Estimate Std. Error  t value    Pr(>|t|)
(Intercept) 1.1945847  0.2298888 5.196359 0.001258035
x1          0.6458170  0.3423214 1.886581 0.101182585
x2          0.6175165  0.2083628 2.963660 0.020995713

se.coef() (arm) returns the standard error of the estimated coefficients.

The vector of fitted values can be returned via fit$fitted, fitted(fit) or the predict() function. The predict() function also returns standard error and confidence intervals for predictions.

 
> fit$fitted
> fitted(fit)

The vector of residuals:

> fit$resid
> residuals(fit)

The number of degrees of freedom :

> fit$df

Confidence intervals[edit | edit source]

We can get the confidence intervals using confint() or conf.intervals() in the alr3 package.

> confint(fit, level = .9)
                   5 %     95 %
(Intercept) -0.7263261 1.200079
x1          -0.5724022 1.909924
x2           0.6185011 2.475079
> confint(fit, level = .95)
                 2.5 %   97.5 %
(Intercept) -0.9652970 1.439050
x1          -0.8803353 2.217858
x2           0.3881923 2.705388
> confint(fit, level = .99)
                 0.5 %   99.5 %
(Intercept) -1.5422587 2.016012
x1          -1.6237963 2.961319
x2          -0.1678559 3.261436
> library(alr3)
> conf.intervals(fit)
                 2.5 %   97.5 %
(Intercept) -0.9652970 1.439050
x1          -0.8803353 2.217858
x2           0.3881923 2.705388

Tests[edit | edit source]

coeftest() (lmtest) performs the Student t test and z test on coefficients.

> library("lmtest")
> coeftest(fit) # t-test
> coeftest(fit,df=Inf) # z-test (for large samples)

linear.hypothesis() (car) performs a finite sample F test on a linear hypothesis or an asymptotic Wald test using statistics.

> library("car")
> linear.hypothesis(fit,"x1 = x2") # tests Beta1 = Beta2
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(1,3)) # Tests  Beta0 = Beta1 = Beta2 = 1
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(0,3)) # Tests  Beta0 = Beta1 = Beta2 = 0
> linear.hypothesis(fit,c("x1","x2"),rep(0,2)) # Tests Beta1 = Beta2 = 0

See also waldtest() (lmtest) for nested models.

Analysis of variance[edit | edit source]

We can also make an analysis of variance using anova().

> anova(fit)

Model Search and information criteria[edit | edit source]

> # Akaike Information Criteria
> AIC(fit)
[1] 26.72857
> # Bayesian Information Criteria
> AIC(fit,k=log(N))
[1] 27.93891

The stats4 package includes AIC() and BIC() function:

> library(stats4)
> ?BIC
> lm1 <- lm(Fertility ~ . , data = swiss)
> AIC(lm1)
[1] 326.0716
> BIC(lm1)
[1] 339.0226

The step() functions performs a model search using the Akaike Information Criteria.

> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> x3 <- rnorm(N)
> y <- 1+ x1 + x2 + u
> fit <- lm(y~x1+x2 + x3)
> step.fit <- step(fit)

Zelig[edit | edit source]

  • The method is also supported in Zelig
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> z.out <- zelig(y ~  x, model = "ls", data = mydat)
> x.out <- setx(z.out, x = 10)
> s.out <- sim(z.out, x.out)
> summary(s.out)

Bayesian estimation[edit | edit source]

  • MCMCregress() (MCMCpack)
  • BLR() (BLR)
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> 
> posterior <- MCMCregress(y ~ x, data = mydat)
> summary(posterior)
> plot(posterior)

Heteroskedasticity[edit | edit source]

  • See the lmtest and sandwich packages.
  • gls() (nlme) computes the generalized least squares estimator.
  • See "Cluster-robust standard errors using R" (pdf) by Mahmood Arai. He suggests two functions for cluster robust standard errors. clx() allow for one-way clustering and mclx() for two-way clustering. They can be loaded with the following command source("http://people.su.se/~ma/clmclx.R").
> N <- 10 # 10 people
> T <- 5 # 5 times
> id <- rep(1:N,T)
> f <- rep(rnorm(N),T) # is individual specific
> u <- rnorm(N*T)
> x1 <- rnorm(N*T) 
> x2 <- rnorm(N*T) + x1
> y <- 1 + x1 + x2 + f + u
> fit <- lm(y ~ x1 + x2 )
> source("http://people.su.se/~ma/clmclx.R")
> clx(fit, 1, id)

Robustness[edit | edit source]

Cook's distance

>library(car)
> cookd(fit)
           1            2            3            4            5 
0.0006205008 0.0643213760 0.2574810866 1.2128206779 0.2295047699 
           6            7            8            9           10 
0.3130578329 0.0003365221 0.0671830241 0.0048474954 0.0714255871

Influence plot:

> influence.plot(fit)

Leverage plots:

> leverage.plot(fit,term.name=x1)
> leverage.plot(fit,term.name=x2)

Bonferroni's outlier test:

> outlier.test(fit)

max|rstudent| = 2.907674, degrees of freedom = 6,
unadjusted p = 0.02706231, Bonferroni p = 0.2706231

Observation: 3

See also outlier.t.test() in the alr3 package.

  • inf.index() in the alr3 package computes all the robustness statistics (Cook's distance, studentized residuals, outlier test, etc)
  • rlm() performs a robust estimation

Instrumental Variables[edit | edit source]

  • ivreg() in the AER package[1]
  • tsls() in the sem package.
  • It is also possible to use the gmm() command in the gmm package. See Methods of moments for an example.

Fake data simulations[edit | edit source]

We first simulate a fake data set with x correlated to u, z and u independent and x correlated with z. Thus x is an endogenous explanatory variable of y and z is a valid instrument for x.

> N <- 1000
> z <- rnorm(N)
> u <- rnorm(N) 
> x <- 1 + z + u + rnorm(N) # x is correlated with the error term u (endogeneity) and the instrument z
> y <- 1 + x + u

Two stage least squares[edit | edit source]

Then we estimate the model with OLS (lm()) and IV using z as an instrument for x.

> ols <- lm(y ~ x)
> summary(ols) # ols are biased
> library("AER")
> iv <- ivreg(y ~ x | z)
> summary(iv) # IV estimates are unbiased
> library("sem")
> iv2 <- tsls(y  ~ x, instruments = ~ z)
> summary(iv2)
> library("gmm")
> iv3 <- gmm(y ~ x, z)
> summary(iv3)

We plot the results :

> plot(y ~ x, col = "gray")
> abline(a  = 1,b = 1, lty = 1, col = 1, lwd = 2)
> abline(ols,  lty = 2, col = 2 , lwd = 2)
> abline(iv, lty = 3, col = 3, lwd = 2)
> legend("topleft", legend = c("True values","OLS","IV"), col = 1:3, lwd = rep(2,3), lty = 1:3)

Panel Data[edit | edit source]

plm() (plm) implements the standard random effect, fixed effect, first differences methods[2]. It is similar to Stata's xtreg command.

Note that plm output are not compatible with xtable() and mtable() for publication quality output.

  • lme4 and gee implements random effect and multilevel models.
  • See also BayesPanel

Random effects model[edit | edit source]

To implement a random effects model we generate a fake data set with 1000 observations over 5 time periods.

> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) 
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])

We estimate the random effect model with the plm() function and the model = "random" option.

> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> # panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> re <- plm(eq, model = "random", data=panel)
> summary(re)

Fixed effects model[edit | edit source]

For a fixed effects model we generate a fake dataset and we correlate the fixed effects f with covariates :

> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) + long$f
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])

We first transform our data in a plm data frame using plm.data(). We estimate the fixed model using plm() with model = "within" as an option. Then, we compare the estimate with the random effect model and perform an Hausman test. At the end, we plot the density of the fixed effects.

> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> #panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> fe <- plm(eq, model = "within", data=panel)
> summary(fe)
> re <- plm(eq, model = "random", data=panel)
> summary(re)
> phtest(fe, re)
> plot(density(fixef(fe)))
> rug(fixef(fe))

Dynamic panel data[edit | edit source]

  • pgmm() (plm) implements the Arellano Bond estimation procedure[3]. It is similar to xtabond2 in Stata[4].

Simultaneous equations model[edit | edit source]

For a [:w:Simultaneous_equations_model|simultaneous equations model] the following packages are needed :

  • sem package
  • systemfit package

References[edit | edit source]

  1. Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER
  2. Yves Croissant, Giovanni Millo (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software 27(2). URL http://www.jstatsoft.org/v27/i02/.
  3. M Arellano, S Bond "Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations" - The Review of Economic Studies, 1991
  4. David Roodman, XTABOND2: Stata module to extend xtabond dynamic panel data estimator, http://ideas.repec.org/c/boc/bocode/s435901.html

External links[edit | edit source]

Previous: Descriptive Statistics Index Next: Maximum Likelihood