R Programming/Binomial Models

From Wikibooks, open books for an open world
Jump to navigation Jump to search

In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory variables.

This kind of model can be analyzed using a linear probability model. However a drawback of this model for the parameter of the Bernoulli distribution is that, unless restrictions are placed on , the estimated coefficients can imply probabilities outside the unit interval . For this reason, models such as the logit model or the probit model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models page.

Logit model[edit | edit source]

The model takes the form : with the inverse link function : . It can be estimated using maximum likelihood or using bayesian methods.

Fake data simulations[edit | edit source]

> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)

Maximum likelihood estimation[edit | edit source]

  • The standard way to estimate a logit model is glm() function with family binomial and link logit.
  • lrm() (Design) is another implementation of the logistic regression model.
  • There is an implementation in the Zelig package[1].

In this example, we simulate a model with one continuous predictor and estimate this model using the glm() function.

> res <- glm(y ~ x , family  = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res) 
> exp(res$coefficients) # odds ratio
> exp(confint(res)) # Confidence intervals for odds ratio (delta method)
> predict(res) # prediction on a linear scale
> predict(res, type = "response") # predicted probabilities
> plot(x, predict(res, type = "response")) # plot the predicted probabilities

Zelig[edit | edit source]

The Zelig' package makes it easy to compute all the quantities of interest.

We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we estimate the model using zelig() with the model = "logit" option.

  • We the look at the predicted values of y at the mean of x1 and x2
  • Then we look at the predicted values when x1 = 0 and x2 = 0
  • We also look at what happens when x1 changes from the 3rd to the 1st quartile.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
> 
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest

> # the same with other values
> x.out <- setx(z.out, x1 = 0, x2 = 0)
> s.out <- sim(z.out, x = x.out)
> summary(s.out)

> # What happens if x1 change from the 3rd quartile to the 1st quartile ? 
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2)) 
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2)) 
> s.out2<-sim(z.out, x=x.high, x1=x.low) 
> plot(s.out2)
  • ROC Curve in the verification package.
  • Zelig has a rocplot() function.

Bayesian estimation[edit | edit source]

  • bayesglm() in the arm package
  • MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.
> # Data generating process
> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> 
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)

> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)

Probit model[edit | edit source]

The probit model is a binary model in which we assume that the link function is the cumulative density function of a normal distribution.

We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter). Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and we draw the binary variable y as Bernouilli random variable.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)

Maximum likelihood[edit | edit source]

We can use the glm() function with family=binomial(link=probit) option or the probit() function in the sampleSelection package which is a wrapper of the former one.

> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
> 
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)

Bayesian estimation[edit | edit source]

  • MCMCprobit() (MCMCpack)
> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)

See Also[edit | edit source]

  • There is an example of a probit model with R on the UCLA statistical computing website[2].

Semi-Parametric models[edit | edit source]

References[edit | edit source]

  1. Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig
  2. UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm
  3. Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.
  4. Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.
Previous: Quantile Regression Index Next: Multinomial Ordered Models