R Programming/Print version

From Wikibooks, open books for an open world
Jump to navigation Jump to search


R Programming

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/R_Programming

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

Introduction

What is R ?[edit | edit source]

R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as t-test, chi-square tests, standard linear models, instrumental variables estimation, local polynomial regressions, etc. It also provides high-level graphics capabilities. There are a few minor similarities between R and C programming languages, but they both run in different ways.

Why use R?[edit | edit source]

  • R is free software. R is an official GNU project and distributed under the Free Software Foundation General Public License (GPL).
  • R is a powerful data-analysis package with many standard and cutting-edge statistical functions. See the Comprehensive R Archive Network (CRAN)'s Task Views to get an idea of what you can do with R.
  • R is a programming language, so its abilities can easily be extended through the use of user-defined functions. A large collection of user-contributed functions and packages can be found in CRAN's Contributed Packages.
  • R is widely used in political science, statistics, econometrics, actuarial sciences, sociology, finance, etc.
  • R is available for all major operating systems (Windows, Mac OS, GNU-Linux).
  • R is object-oriented. Virtually anything (e.g., complex data structures) can be stored as an R object.
  • R is a matrix language.
  • R syntax is much more systematic than Stata or SAS syntax.
  • R can be installed on your USB stick[1].

Alternatives to R[edit | edit source]

  • S-PLUS is a commercial version of the same S programming language that R is a free version of.
  • Gretl is free software for econometrics. It has a graphical user interface and is nice for beginners.
  • SPSS is proprietary software which is often used in sociology, psychology and marketing. It is known to be easy to use.
  • GNU PSPP is a free-software alternative to SPSS.
  • SAS is proprietary software that can be used with very large datasets such as census data.
  • Stata is proprietary software that is often used in economics and epidemiology.
  • Julia is a general programming language, with capabilities similar to MATLAB, R and Python (and speed of C), and can call libraries from all those.
  • MATLAB is proprietary software used widely in the mathematical sciences and engineering.
  • Octave is free software similar to MATLAB. The syntax is the same and MATLAB code can be used in Octave.
  • Python is a general programming language. It includes some specific libraries for data analysis such as Pandas[2] ·[3].

Beginners can have a look at GNU PSPP or Gretl. Intermediate users can check out Stata. Advanced users who like matrix programming may prefer MATLAB or Octave. Very advanced users may use C or Fortran.

See also[edit | edit source]

R programming style[edit | edit source]

  • R is an object oriented programming language. This means that virtually everything can be stored as an R object. Each object has a class. This class describes what the object contains and what each function does with it. For instance, plot(x) produces different outputs depending on whether x is a regression object or a vector.
  • The assignment symbol is "<-". Alternatively, the classical "=" symbol can be used.

The two following statements are equivalent :

 > a <- 2
 > a = 2
  • Arguments are passed to functions inside round brackets (parentheses).
  • One can easily combine functions. For instance you can directly type
mean(rnorm(1000)^2)
  • The symbol "#" comments to the end of the line:
 # This is a comment
 5 + 7 # This is also a comment
  • Commands are normally separated by a newline. If you want to put more than one statement on a line, you can use the ";" delimiter.
 a <- 1:10 ; mean(a)
  • You can also have one statement on multiple lines.
  • R is case sensitive: a and A are two different objects.
  • Traditionally underscores "_" are not used in names. It is often better to use dots ".". One should avoid using an underscore as the first character of an object name.
 1:10 |> mean(.)
  • You can also use the pipe operator |>.

How you can help[edit | edit source]

Here are some things editors do to keep this book internally consistent. If you have something to contribute, go ahead and make your contribution. Other editors can touch up your edits afterwards so that they conform to the guidelines.

The local manual of style WB:LMOS for the R programming book, including a brief explanation of why we do it that way, is:

  • Examples use "source" tags : <syntaxhighlight lang="rsplus"> a <- 1:10 ; mean(a) </syntaxhighlight>. That makes them look pretty to our readers.
  • The name of packages are in bold  : '''Hmisc'''.
  • Name of functions are in "code" tags: <code>lm()</code>.
  • Page titles -- the part after "R Programming/" -- are in sentence case, like "R Programming/Working with data frames". We couldn't decide between sentence case and title case, so I flipped a coin.
  • Every page has <noinclude>{{R Programming/Navigation}}</noinclude> at the top and {{R Programming/Navbar|Mathematics|Probability Distributions}} at the bottom. That makes it easier to navigate from one page to another online.

See Also[edit | edit source]

References[edit | edit source]

  1. Portable R by Andrew Redd http://sourceforge.net/projects/rportable/
  2. "Python Data Analysis Library". pandas.pydata.org/. Retrieved February 14, 2013.
  3. "Getting started with Pandas". blog.kaggle.com. January 17, 2013. Retrieved February 14, 2013.
Index Next: Sample Session


Sample Session

This page is an introduction to the R programming language. It shows how to perform very simple tasks using R. First you need to have R installed (see the Settings page). If you use Windows or Mac OS, the easiest solution is to use the R Graphical User Interface (click on its icon). If you use Linux, open a terminal and type R at the command prompt.

Usually when you open R, you see a message similar to the following in the console:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from ~/.RData]

>

You can type your code after the angle bracket >.

R can be used as a simple calculator and we can perform any simple computation.

 
> # Sample Session 
> # This is a comment
> 
> 2 # print a number
[1] 2
> 2+3 # perform a simple calculation
[1] 5
> log(2) # natural log
[1] 0.6931472

We can also store numeric or string objects using the assignment operator, <-.

> x <- 2 # store an object
> x # print this object
[1] 2
> (x <- 3) # store and print an object
[1] 3
> 
> x <- "Hello" # store a string object
> x
[1] "Hello"

We can also store vectors.

> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) #store a vector
> Height  # print the vector
 [1] 168 177 177 177 178 172 165 171 178 170
> 
> Height[2] # Print the second component
[1] 177
> Height[2:5] # Print the second, the 3rd, the 4th and 5th component
[1] 177 177 177 178
> 
> (obs <- 1:10) # Define a vector as a sequence (1 to 10)
 [1]  1  2  3  4  5  6  7  8  9 10
> 
> Weight <- c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75)
> 
> BMI <- Weight/((Height/100)^2)   # Performs a simple calculation using vectors
> BMI
 [1] 31.17914 22.98190 27.13141 16.59804 22.40879 23.32342 22.40588 20.86112
 [9] 16.09645 25.95156

We can also describe the vector with length(), mean() and var().

> length(Height)
[1] 10
> mean(Height) # Compute the sample mean
[1] 173.3
> var(Height)
[1] 22.23333

We can also define a matrix.

> M <- cbind(obs,Height,Weight,BMI) # Create a matrix
> typeof(M) # Give the type of the matrix
[1] "double"
> class(M)  # Give the class of an object
[1] "matrix"
> is.matrix(M) # Check if   M is a matrix
[1] TRUE
> is.vector(M)  # M is not a vector
[1] FALSE
> dim(M)    # Dimensions of a matrix
[1] 10  4

We can plot the data using plot().

 
> plot(Height,Weight,ylab="Weight",xlab="Height",main="Corpulence")

We can define a dataframe.

 
> mydat <- data.frame(M) # Creates a dataframe
> names(mydat) # Give the names of each variable
[1] "obs"    "Height" "Weight" "BMI"   
> str(mydat)   # give the structure of your data
'data.frame':   10 obs. of  4 variables:
 $ obs   : num  1 2 3 4 5 6 7 8 9 10
 $ Height: num  168 177 177 177 178 172 165 171 178 170
 $ Weight: num  88 72 85 52 71 69 61 61 51 75
 $ BMI   : num  31.2 23 27.1 16.6 22.4 ...
> 
> View(mydat)  # Look at your data
> 
> summary(mydat)  # Descriptive Statistics
      obs            Height          Weight           BMI       
 Min.   : 1.00   Min.   :165.0   Min.   :51.00   Min.   :16.10  
 1st Qu.: 3.25   1st Qu.:170.2   1st Qu.:61.00   1st Qu.:21.25  
 Median : 5.50   Median :174.5   Median :70.00   Median :22.70  
 Mean   : 5.50   Mean   :173.3   Mean   :68.50   Mean   :22.89  
 3rd Qu.: 7.75   3rd Qu.:177.0   3rd Qu.:74.25   3rd Qu.:25.29  
 Max.   :10.00   Max.   :178.0   Max.   :88.00   Max.   :31.18  
>

You can save an R session (all the objects in memory) and load the session.

> save.image(file="~/Documents/Logiciels/R/test.rda")
> load("~/Documents/Logiciels/R/test.rda")

We can define a working directory. Note for Windows users : R uses slash ("/") in the directory instead of backslash ("\").

> setwd("~/Desktop")            # Sets working directory (character string enclosed in "...")
> getwd()                       # Returns current working directory
[1] "/Users/username/Desktop"
> dir() * Lists the content of the working directory

There are some special characters in R

  • NA : Not Available (i.e. missing values)
  • NaN : Not a Number (e.g. 0/0)
  • Inf: Infinity
  • -Inf : Minus Infinity.

For instance 0 divided by 0 gives a NaN but 1 divided by 0 gives

 > 0/0
 [1] NaN
 > 1/0
 [1] Inf

We can exit R using q(). The no argument specifies that the R session is not saved.

q("no")


Manage your workspace

This page explains how to manage your workspace.

Basic functions[edit | edit source]

  • ls() lists the objects in your workspace.
  • list.files() lists the files located in the folder's workspace
  • rm() removes objects from your workspace; rm(list = ls()) removes them all.
rm(list=ls()) # remove all the objects in the workspace

Each object can be saved to the disk using the save() function. They can then be loaded into memory using load().

load("file.Rda")
...
# assume you want to save an object called 'df'
save(df, file = "file.Rda")
  • save.image() saves your workspace.

Informations about the session[edit | edit source]

  • sessionInfo() gives information about your session, i.e., loaded packages, R version, etc.
  • R.version provides information about the R version.

Memory usage[edit | edit source]

Note: According to R version 3.5.1 on Linux and Mac, memory.size() and memory.limit() are Windows-specific.

memory.size() gives the total amount of memory currently used by R.

> memory.size()
[1] 10.18

memory.limit() without any argument gives the limit of memory used by R. This can also be used to increase the limit. The maximum amount is limited by the memory of the computer.

> memory.limit()
[1] 1535
>  memory.limit(size=2000) # 2000 stands for 2000 MB
[1] 2000

object.size() returns the size of an R object. You can print the results and choose the unit (byte,kilobytes,megabytes,etc).

> a <- rnorm(10^7)
> object.size(a)
80000024 bytes
> print(object.size(a),units="b")
80000024 bytes
> print(object.size(a),units="Kb")
78125 Kb
> print(object.size(a),units="Mb")
76.3 Mb
> print(object.size(a),units="Gb")
0.1 Gb
> print(object.size(a),units="auto")
76.3 Mb

memory.profile() returns more details.

> memory.profile()
       NULL      symbol    pairlist     closure environment     promise 
          1        4959       61794        1684         255        3808 
   language     special     builtin        char     logical     integer 
      14253          46         687        5577        2889        4060 
     double     complex   character         ...         any        list 
        523           1       11503           0           0        1024 
 expression    bytecode externalptr     weakref         raw          S4 
          1           0         497         117         118         642
  • gc() initiates the garbage collector which causes R to free memory from objects no longer used.
> gc()
           used (Mb) gc trigger  (Mb) max used (Mb)
Ncells  1095165 58.5    1770749  94.6  1770749 94.6
Vcells 12060564 92.1   17769683 135.6 12062095 92.1

References[edit | edit source]

External links[edit | edit source]

Index Next: Settings


Settings

This page show how to install R, customize it and choose a working environment. Once you have installed R, you may want to choose a working environment. This can be a simple text editor (such as Emacs, Vim or Gedit), an integrated development interface (IDE) or graphical user interface (GUI). RStudio is now a popular option.

Installation[edit | edit source]

Linux[edit | edit source]

Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as to type in sudo aptitude install r-base or sudo apt-get install r-base (don't forget that this has to be done as root), or installing the package r-base using your favourite package manager, for example Synaptic.

There is also a bunch of packages extending R to different purposes. Their names begin with r-. Take a closer look at the package r-recommended. It is a metapackage that depends on a set of packages that are recommended by the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this package, as it depends on r-base.

Installation with apt-get (Debian, Ubuntu and all linux distributions based on Debian)

sudo apt-get install r-base
sudo apt-get install r-recommended

Installation with aptitude (Debian, Ubuntu and all linux distributions based on Debian)

sudo aptitude install r-base
sudo aptitude install r-recommended

Mac OS[edit | edit source]

Installation : Visit the R project website (http://r-project.org/), select the "CRAN" page and choose mirror. Download the disk image (dmg file) and install R.

The default graphical user interface for Mac is much better than the one for Windows. It includes

  • a dataframe manager,
  • a history of all commands,
  • a program editor which supports syntax highlighting.

Windows[edit | edit source]

(Section source [1])

Download[edit | edit source]

To install R under Windows operating system you have to download the binaries from the web. First go to r-project.org and click CRAN under download section on the left panel and select a mirror site, from where you could download the required content. The best idea is pick a mirror closest to your actual geographical location, but other ones should work as well. The click Windows and in subdirectories base. The windows binary is the exe file, in form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version the setup has the same steps.

Setup[edit | edit source]

As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems. However, there are few things that you can alter.

  1. On the welcome screen click Next.
  2. Read or just notice the GNU license, and click Next.
  3. Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc, the default choice will be OK for you.
  4. During the next step you can specify which parts of R you want to install. Choices are: User installation, Minimal user installation, Full installation and Custom installation. Notice the required space under the selection panel (varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
  5. In this step you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation (see lower).
  6. You can specify the Start menu folder.
  7. In the next step you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and specify registry entries.


With these steps you can customize the R graphical user interface.

  • You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window (SDI).
  • You can select the style, how the Help screen is displayed in R. You will use help a lot, so this may be an important decision. It is up to you, which style you prefer. Please note, that the content of help file will be the same regardless of your choice. Here you specify just the appearance of that particular window.
  • In the next step you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard option here.

Update[edit | edit source]

Updating R on Windows requires several steps:

  1. Downloading/installing the latest version of R
  2. Copying your packages from the library folder to the one in the new R installation

Both of these steps can easily be done using the installr package, by running the following command (which would both install the package, and update R) [2]:

# installing/loading the package:
if(!require(installr)) { 
install.packages("installr"); require(installr)} #load / install+load installr
updateR() # updates R

There is also the possibility of using a "global" package library, see here for more details.

Portable R for Windows[edit | edit source]

You have a portable version if you want to install R on your USB stick[3]. This is useful if you don't have admin rights on a computer. The basic installation requires something like 115 mb but you may need more if you want to install add-on packages.

Working environment[edit | edit source]

Once you have installed R, you need to choose a working environment. In this section, we review all possible working environment. This include a basic terminal as well as integrated development environment (therefore IDE), text editors or graphical user interface (therefore GUI).

  • A graphical user interface provides some menu which makes it possible to run R without writing code. This is a good solution for beginners.
  • A text editor makes it easy to write code.
  • An integrated development environment provides a text editor and a compiler which makes it easy to write R scripts, to run them and to correct them.

Note that there are some task specific GUIs. For instance speedR provides a GUI to import data into R.

Terminal[edit | edit source]

R in a Terminal window on Linux.

For Linux and Mac OS users it is possible to use R from the terminal.

$ R
> q("no") # to leave R and return to the terminal

R Gui[edit | edit source]

For Mac OS and Windows users, there is a graphical user interface. In Mac OS, the GUI includes a package manager, a program editor with syntax highlighting and a data browser. In Windows, the GUI is not better than a Terminal.


Graphical User Interface[edit | edit source]

This section includes material for beginners (eg people who are not familiar with computing).

Poor Man's GUI (pmg)[edit | edit source]

A simple GUI for learning R. It is recommanded for beginners.

> install.packages("pmg", dependencies=TRUE)
# Windows users may also run the following scripts to install required libraries
> source("http://www.math.csi.cuny.edu/pmg/installpmg.R")


> library(pmg)

Jaguar : Java GUI for R[edit | edit source]

  • Jaguar : Java GUI for R[4] is available for Linux, Mac and Windows (screenshots).
  • It is good for beginners.

R commander[edit | edit source]

  • Rcommander[5] developed by John Fox provides a menu in the standard Graphical User Interface (screenshots).
  • It works on Linux, Mac and Windows.
  • It is a good interface for beginners and for people who are not used to script editing.
> install.packages("Rcmdr") # installation
> library("Rcmdr") # usage
  • Ubuntu users can also install R Commander from the software center.


Integrated development environment[edit | edit source]

RStudio[edit | edit source]

RStudio on Ubuntu 12.10.

RStudio is an integrated development interface for R[6].

  • It works on Mac, Windows and Linux platforms.
  • It supports Sweave and LaTeX.
  • It includes syntax highlighting for R, LaTeX and Sweave.
  • It includes a way to view variables and dataframes.
  • It makes it easy to load and install package, to navigate in the help files and to manage your workspace.
  • It supports code and file name completion.
  • It can be installed on a USB stick.

John Verzani has written a book dedicated to this new interface, Getting Started with RStudio[7] and Jeffrey Racine recommand RStudio for Sweave[8].

RKward[edit | edit source]

RKward is an IDE and a GUI for Linux (KDE) (Screenshots). RKWard aims to provide an easily extensible, easy to use IDE/GUI for R. RKWard tries to combine the power of the R-language with the (relative) ease of use of commercial statistics tools.

Eclipse with StatET[edit | edit source]

Eclipse with the StatET plugin[9] provides an IDE for R.

  • It supports Sweave.

Rattle GUI[edit | edit source]

Tinn R[edit | edit source]

  • For Windows only
  • Tinn R[12] is a good IDE for Windows users. One can easily define keyboard shortcuts to execute selected R code from Tinn R.

Notepad++ and NpptoR[edit | edit source]

  • For Windows only.

Notepad++[13] and NPPtoR[14] provides syntax highlighting and hotkeys (by default F8) to send lines of code to R. Syntax highlighting can be easily modified using the dialog box to manage user define languages (Menu/View/Use Define Dialog...). NPPtoR provides a method to generate syntax highlighting dynamically (depending on all the available packages in the R environment).

Vi, Vim and GVim[edit | edit source]

  • Vim and GVim provides syntax highlighting
  • Vim is for advanced users only
  • The Vim-R-plugin allows the communication between Vim and R

Emacs and ESS[edit | edit source]

  • Emacs with ESS (Emacs Speaks Statistics)[15].
  • For Linux users, you just have to install emacs and ESS using your standard package manager (synaptic, aptitude, yum, etc)
  • For Mac and Windows user, you can have a look at Vincent Goulet's page which has binary with Emacs and ESS[16].
  • For Mac users, Aquamacs Emacs is a good solution. It is an enhancement of the standard Emacs editor.
  • For Windows users, XEmacs is a good solution.


  • Once the installation of Emacs and ESS is done, you just have to open Emacs and open or create a file with extension .R (C-x C-f). ESS will be automatically loaded.
    • C-c M-j evaluates the current line
    • C-c M-r evaluates the current region
    • C-c M-b evaluates the current buffer

WinEdt[edit | edit source]

  • How to use R for Windows with the RWinEdt extension ? by Andy Eggers[17]
  • WinEdt is not open source
  • WinEdt is for Windows only.
  • Install the RWinEdt package.

gedit with gedit-r-plugin[edit | edit source]

  • For Linux users only.
  • There is also a plugin for gedit called gedit-r-plugin. This can be installed using Synaptic or any other package manager on a linux platform.

Customizing R[edit | edit source]

R profile[edit | edit source]

R can be customized using the Rprofile file. On Linux, this file is stored in the home directory. You can edit it by running the following command in a terminal :

$ gedit ~/.Rprofile

If you use some packages very often, you can load them systematically using the Rprofile file. You can also change the default options.

Options[edit | edit source]

The function options() without any argument show all options

> options()

The linguistic and encoding options can be modified using Sys.setlocale() :

> Sys.setlocale()
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/en_US.UTF-8"

By default, error messages are in the local language. However, it is possible to set them in English using Sys.sentev()

Sys.setenv(LANGUAGE='en')

References[edit | edit source]

  1. This section was imported from the Wikiversity project Installation, How to use R course
  2. Updating R from R (on Windows) – using the {installr} package
  3. Portable R http://sourceforge.net/projects/rportable/
  4. http://jgr.markushelbig.org/JGR.html
  5. http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/
  6. rstudio.org
  7. John Verzani "Getting Started with RStudio An Integrated Development Environment for R", O'Reilly Media, September 2011
  8. Jeffrey Racine, (forthcoming), "RStudio: A Platform Independent IDE for R and Sweave," Journal of Applied Econometrics.
  9. StatET : http://www.walware.de/goto/statet
  10. Rattle : http://rattle.togaware.com/
  11. Graham J Williams. Rattle: A Data Mining GUI for R. The R Journal, 1(2):45-55, December 2009
  12. Tinn stands for Tinn Is Not Notepad http://www.sciviews.org/Tinn-R/
  13. Note that Notepad++ can be installed on a USB stick http://sourceforge.net/projects/notepadpluspe/
  14. NPPtoR is also a portable software http://sourceforge.net/projects/npptor/
  15. ESS : http://ess.r-project.org/
  16. Vincent Goulet Emacs page http://vgoulet.act.ulaval.ca/emacs
  17. http://www.people.fas.harvard.edu/~aeggers/RWinEdt_installation.pdf
Previous: Data types Index Next: Packages


Documentation

Obtaining Help[edit | edit source]

For each package you have a reference manual available as an HTML file from within R or as a PDF on the CRAN website. You also often have Vignettes or comprehensive articles in the R Journal, the Journal of Statistical Software, etc.

library(help="package_name")
vignette("np",package="np")
vignette(all=FALSE) # vignettes for all attached packages
vignette(all=TRUE) # vignettes for all packages on the computer

You can search for help inside all loaded packages using help() or ?. Usually you do not need to add quotes to function names, but sometimes it can be useful. args() gives the full syntax of a function.

help(lm)
?lm
?"for"
?"[["
args("lm")
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
# NULL

apropos() and find() looks for all the functions in the loaded packages containing a keyword or a regular expression[1].

apropos("norm")
#   [1] "dlnorm"         "dnorm"          "plnorm"        
#   [4] "pnorm"          "qlnorm"         "qnorm"         
#   [7] "qqnorm"         "qqnorm.default" "rlnorm"        
#  [10] "rnorm"          "normalizePath"

You can search for help in all installed packages using help.search() or its shortcut ??.

??"lm"
help.search("covariance")

RSiteSearch() looks for help in all packages and in the R mailing lists. The sos package improves the RSiteSearch() function with the findFn() function. ??? is a wrapper for findFn().

RSiteSearch("spline")
library("sos")
findFn("spline", maxPages = 2)
???"spline"(2)

hints() in the hints package suggests what to do with an object.

fit <- lm(y ~ x)
library("hints")
hints(fit) # returns a list of function using lm objects.

Handouts[edit | edit source]

Teaching Resources[edit | edit source]

Blogs[edit | edit source]

Journals[edit | edit source]

Books[edit | edit source]

useR and other R conferences[edit | edit source]

Search Engine[edit | edit source]

Q&A / Forums[edit | edit source]

References[edit | edit source]

  1. If you want to know more about regular expressions, have a look at the Regular expressions section in the Text Processing page.
  2. Introduction to Data Analysis
Previous: Data types Index Next: Sample Session


Control Structures

Conditional execution[edit | edit source]

  • Help for programming :
> ?Control

if accepts a unidimensional condition.

> if (condition){
+     statement  
+     } 
> else{
+     alternative
+     }

The unidimensional condition may be one of TRUE or FALSE, T or F, 1 or 0 or a statement using the truth operators:

  • x == y "x is equal to y"
  • x != y "x is not equal to y"
  • x > y "x is greater than y"
  • x < y "x is less than y"
  • x <= y "x is less than or equal to y"
  • x >= y "x is greater than or equal to y"

And may combine these using the & or && operators for AND. | or || are the operators for OR.

> if(TRUE){
+     print("This is true")
+     }
  [1] "This is true"
> x <- 2  # x gets the value 2
> if(x==3){
+     print("This is true")
+     } else {
+     print("This is false")
+     }
 [1] "This is false"
> y <- 4 # y gets the value 4
> if(x==2 && y>2){
+     print("x equals 2 and y is greater than 2")
+     }
 [1] "x equals 2 and y is greater than 2"

The ifelse() command takes as first argument the condition, as second argument the treatment if the condition is true and as third argument the treatment if the condition is false. In that case, the condition can be a vector. For instance we generate a sequence from 1 to 10 and we want to display values which are lower than 5 and greater than 8.

> x <- 1:10 
> ifelse(x<5 | x>8, x, 0)
 [1]  1  2  3  4  0  0  0  0  9 10

Sets[edit | edit source]

R has some very useful handlers for sets to select a subset of a vector:

> x = runif(10)
> x<.5
 [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
> x
 [1] 0.32664759 0.57826623 0.98171138 0.01718607 0.24564238 0.62190808 0.74839301 
 [8] 0.32957783 0.19302650 0.06013694
> x[x<.5]
[1] 0.32664759 0.01718607 0.24564238 0.32957783 0.19302650 0.06013694

to exclude a subset of a vector:

> x = 1:10
> x
 [1]  1  2  3  4  5  6  7  8  9 10
> x[-1:-5]
[1]  6  7  8  9 10

Loops[edit | edit source]

Implicit loops[edit | edit source]

Example of fast code using vectorisation

R has support for implicit loops, which is called vectorization. This is built-in to many functions and standard operators. for example, the + operator can add two arrays of numbers without the need for an explicit loop.

Implicit Loops are generally slow, and it is better to avoid them when it is possible.

  • apply() can apply a function to elements of a matrix or an array. This may be the rows of a matrix (1) or the columns (2).
  • lapply() applies a function to each column of a dataframe and returns a list.
  • sapply() is similar but the output is simplified. It may be a vector or a matrix depending on the function.
  • tapply() applies the function for each level of a factor.
> N <- 10
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> male <- rbinom(N,1,.48)
> y <- 1 + x1 + x2 + male + rnorm(N)
> mydat <- data.frame(y,x1,x2,male)
> lapply(mydat,mean) # returns a list
$y
[1] 3.247

$x1
[1] 0.1415

$x2
[1] 1.29

$male
[1] 0.5

> sapply(mydat,mean) # returns a vector
     y     x1     x2   male 
3.2468 0.1415 1.2900 0.5000 
> apply(mydat,1,mean) # applies the function to each row
 [1]  1.1654  2.8347 -0.9728  0.6512 -0.0696  3.9206 -0.2492  3.1060  2.0478  0.5116
> apply(mydat,2,mean) # applies the function to each column
     y     x1     x2   male 
3.2468 0.1415 1.2900 0.5000 
> tapply(mydat$y,mydat$male,mean) # applies the function to each level of the factor
    0     1 
1.040 5.454
  • See also aggregate() which is similar to tapply() but is applied to a dataframe instead of a vector.

Explicit loops[edit | edit source]

R provides three ways to write loops: for, repeat and while. The for statement is excessively simple. You simply have to define index (here k) and a vector (in the example below the vector is 1:5) and you specify the action you want between braces.

> for (k in 1:5){
+ print(k)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

When it is not possible to use the for statement, you can also use break or while by specifying a breaking rules. One should be careful with this kind of loops since if the breaking rules is misspecified the loop will never end. In the two examples below the standard normal distribution is drawn in as long as the value is lower than 1. The cat() function is used to display the present value on screen.

> repeat { 
+ 	g <- rnorm(1) 
+ 	if (g > 1.0) break 
+ 	cat(g,"\n")
+ 	} 
-1.214395 
0.6393124 
0.05505484 
-1.217408 
> g <- 0
> while (g < 1){
+ 	g <- rnorm(1) 
+ 	cat(g,"\n")
+ 	}
-0.08111594 
0.1732847 
-0.2428368 
0.3359238 
-0.2080000 
0.05458533 
0.2627001 
1.009195

The next statement can be used to discontinue one particular cycle and skip to the “next”.

> for (k in 1:10) { 
+   if(k==8) {
+     print("skipped")
+     next
+   }
+   print(k)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] "skipped"
[1] 9
[1] 10

Iterators[edit | edit source]

References[edit | edit source]

Previous: Random Number Generation Index Next: Data Management


Working with functions

Looking at the code of a function[edit | edit source]

  • You can type the name of the function in the console without any round brackets after the name. This will print the code of the function in the console.
  • You can also use the page() function which opens a new editor window and prints the code of the function in this editor.
  • You can also use the trCopy() function in the TinnR package to copy the code of the function. Then you just have to paste it in a text editor to have a look at it.

Here is an example with the lm() function.

> lm
> page(lm)
> library(TinnR)
> trCopy(lm)

Creating your own function[edit | edit source]

A simple function without argument which doesn't return anything[edit | edit source]

> fn <- function(){
+ print("hello")
+ }
> fn()
[1] "hello"


Returning an object[edit | edit source]

By default the value of the last line (*) is returned. In the following example, we have a simple function with two objects. The last one is returned.

> test <- function() {
+ x <-1
+ z <- 2
+ }
> res <- test()
> res
[1] 2

The function can return an object explicitly using return() (but as it is the last line, you could simply use x instead):

> test <- function() {
+ x <- 1
+ z <- 2
+ return(x)
+ }
> res <- test()
> res
[1] 1
  • ) More precisely, it is not the "last line" but rather the value of the last evaluation which is returned from the function.

Adding arguments[edit | edit source]

It is possible to add arguments.

square <- function(x){
	x2 <- x^2
	return(x2)
	}
square(x = 2)

Note that the above function would rather be written (and be more efficient) as

square <- function(x) x^2

(as the last value is returned)

The ... argument means that you can add other arguments which will be passed to functions inside the function.

plot2 <- function(x,...){
	plot(x, type = "l", ...)
	}
plot2(runif(100), main = "line plot", col = "red")

It is possible to add a dataframe as argument[1]. Here is an example :

redplot <- function(z, y, data, env=parent.frame()) {
       if(!missing(data)){
			z <- data[,deparse(substitute(z))]
			y <- data[,deparse(substitute(y))]
			}
	plot(z,y, col = "red", pch = 15)
} 

mydat <- data.frame(vm = rnorm(10),output = rnorm(10))
redplot(vm,output,data=mydat)

For estimation commands it is possible to add formulas as arguments. For instance, we can create our own function for ordinary least square using a formula interface.

ols <- function(formula, data = list()) {
	mf <- model.frame(formula=formula, data=data)
	X <- model.matrix(attr(mf, "terms"), data=mf)
	y <- model.response(mf)
	beta <- solve(t(X)%*%X)%*%t(X)%*%y
	se <- sqrt( 1/(nrow(X) - ncol(X)) * sum((y - X%*%beta)^2) * diag(solve(t(X)%*%X)))
	res <- cbind(beta,se)
	colnames(res) <- c("Coefficients","Standard errors")
	res
}
N <- 100
u <- rnorm(N)
x <- rnorm(N) + 1
y <- 1 + x + u
ols(y~x)

Recursive functions[edit | edit source]

R supports recursive functions. The function below computes Fibonacci numbers recursively.

> fib <- function(n) {
              if(n > 2) {
                   m <- fib(n-1)
                   c(m, sum(tail(m, 2)))
                   }
              else rep(1, n)
              }
> fib(30)
 [1]      1      1      2      3      5      8     13     21     34     55
[11]     89    144    233    377    610    987   1597   2584   4181   6765
[21]  10946  17711  28657  46368  75025 121393 196418 317811 514229 832040

Functions as Objects[edit | edit source]

R functions can be treated as objects

> a <- function(n) function(a) runif(a)
> b <- a(1)
> b(10)
 [1] 0.8726873 0.9512367 0.5971435 0.5540743 0.6378967 0.4030071 0.2750673 0.1777123 0.6960378 0.3969920

This can be useful when wanting to make many different kinds of functions

> a <- list()
> b <- function(i){ i; function() runif(i)}
> for (i in 1:10) a[[i]] <- b(i)
> a[[1]]()
[1] 0.2617396
> a[[2]]()
[1] 0.8822248 0.3374574
> a[[3]]()
[1] 0.0348156 0.4212788 0.6107646

Higher-order functions[edit | edit source]

You can use higher-order functions in R. Contrary to common belief, using them instead of loops, is not faster, because the apply function has a for-loop inside its definition. Use them only to improve clarity of your code.[2]

apply[edit | edit source]

apply is the most basic of R's map functions. lapply, sapply and mapply are convenient interfaces for apply that work on lists, vectors and multiple vectors respectively.

apply takes as arguments an array, a vector of the dimension to map along and a function. The following example is based on the apply documentation. It uses apply to compute column and row sums of a matrix.

x <- matrix(round(rnorm(100)),10,10)
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)


tapply[edit | edit source]

tapply is similar to apply, but applies a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

> x1 <- rnorm(10)
> x2 <- sample(1:2, 10, replace = T)
> cbind(x1,x2)
              x1 x2
 [1,] -1.7905021  1
 [2,]  1.2908169  2
 [3,] -2.1902513  2
 [4,]  0.4845488  1
 [5,]  0.2281593  1
 [6,]  0.2201302  1
 [7,]  2.1574243  1
 [8,]  0.5789705  2
 [9,]  1.3315188  1
[10,] -1.0029822  2
> tapply(x1, x2, sum)
        1         2 
 2.631279 -1.323446

Reduce[edit | edit source]

This function from the Reduce documentation cumulatively adds

> cadd <- function(x) Reduce("+", x, accumulate = TRUE)
> cadd(1:10)
 [1]  1  3  6 10 15 21 28 36 45 55

References[edit | edit source]


Debugging

Some basic tips[edit | edit source]

  • Use print() statements in your functions to print variable values. Although this technique is considered low-tech or even old fashioned by some, it can still be a quick and easy way to trace an error.
  • Place a browser() statement in the function just before the crashing line. When the function is called, it will be executed up to the browser() line. The command-line interface then switches to the function environment, so that all variables in the function can be inspected or changed. See below for commands available in browser() mode.

Tracing errors with traceback()[edit | edit source]

A standard error message in R will tell you which function threw the error. Consider as an example the following function whose sole purpose is to throw an error.

myFun <- function(){
    stop("Woops! An error")
}

A call to myFun() gives

> myFun()
Error in myFun() : Woops! An error

After an error is raised, the traceback() function allows you to show the call stack leading to the error. For example, the function below calls myFun.

myFun2 <- function(){
    myFun()   
}

Calling myFun2() and traceback() gives

> myFun2()
Error in myFun() : Woops! An error
> traceback()
3: stop("Woops! An error")
2: myFun()
1: myFun2()

The traceback() function can be executed automatically each time an error is raised with the option

options(error=traceback)

It may be switched off again with

options(error=NULL)

Executing code line by line[edit | edit source]

A function can be executed by setting it to debugging mode with

debug(FUNCTION_NAME)

.

Then, when the function is called, and a browser in that function's environment is opened so that it can be executed line by line. In the debugging browser, apart from all standard R functionality, the following commands are available.

Command Meaning
n Advance to next step. An empty line also works.
c, cont Continue to the end of the current context. E.g. to the end the loop within a loop or to the end of the function.
where Print the stack of function calls (where are you?)
Q Exit the browser and return to the top-level R prompt.

Debugging can be switched off with

undebug(FUNCTION_NAME)

There are a few related functions as well:

  • debugonce() Switch off debugging after the first call.
  • isdebugged() Check if a function is in degugging mode.

Browsing the call stack[edit | edit source]

This is the most advanced debugging option in R base. By setting options(error=recover) you get the opportunity to browse any environment in the call stack. For example,

> options(error=recover)
> myFun2()
Error in myFun() : Woops! An error

Enter a frame number, or 0 to exit   

1: myFun2()
2: myFun()

Selection:

By typing '1' or '2' behind Selection: the browser will jump to the selected environment. Once in the browser, all standard R functionality is at your disposal, as well as the commands in the table below.

Command Meaning
c, cont Exit the browser and continue at the next statement. An empty line will do the same.
n Enter the step-through debugger (this changes the meaning of c)
where Print a stack trace of active function calls (where are you in the stack?).
Q Exit the browser, do not continue at the next statement but go back to the top-level R browser.

Recovery mode can be switched off by

options(error=NULL)


Using C or Fortran

For some tasks, R can be slow. In that case, it is possible to write a program in C or Fortran and to use it from R. This page is for advanced programmers only.

References[edit | edit source]


Utilities

This page includes material about some utilities. Most of the functions presented here have nothing to do with statistical analysis but may be useful when working on a project. Many functions are just similar to standard unix functions.

System (Unix/DOS)[edit | edit source]

system() gives access to the system (DOS or unix). The option wait=FALSE means that you don't ask R to wait that the task is finished.

Some examples :

  • You can convert an image from to PS to PNG using the unix convert function of your computer. If you want to know more about this function, open a Terminal application and type man convert (This should work on Mac OS and Linux).
  • You can open Stata and run a program.
  • You can run pdflatex from R and directly open the pdf in a pdf browser.
system("convert W:/toto.ps W:/toto.png") # converts toto.ps to toto.png
system("D:/Stata10/stata.exe do D:/pgm.do", wait = F) # opens Stata and run pgm.do
system("pdflatex.exe -shell-escape file.tex") # runs pdflatex
system("open file.pdf") # opens the pdf
system("open M:/.../doc/*.pdf") # opens all the pdf in a directory

See also sys() in the Hmisc package, shell() and shell.exec().

File Handling[edit | edit source]

dir() lists all the files in a directory. It is similar to the Unix function ls. dir.create() creates a new directory. It is similar to mkdir in Unix.

file.info() gives information about a file.

> file.info("taille.txt")
           size isdir mode               mtime               ctime               atime exe
taille.txt  444 FALSE  666 2009-06-26 12:25:44 2009-06-26 12:25:43 2009-06-26 12:25:43  no

Removing files with a specific pattern :

file.remove(dir(path="directoryname", pattern="*.log"))
  • file.edit() opens a file in the text editor.
  • file.show() opens a file in a new window.
  • tempfile() creates a temporary file.
  • getZip() in the Hmisc package.

Internet[edit | edit source]

browseURL() opens an URL using an internet browser. download.file() download a file from the internet.

> browseURL("http://en.wikibooks.org/wiki/R_Programming")

To see the default browser, use getOption()

getOption("browser")

We can change the default browser using the options() command. It is safer to store the options before.

oldoptions <- options() # save the options
options(browser = "D:/FramafoxPortable/FramafoxPortable.exe")

You can download a file from the internet using download.file(). Note that very often you don't need to download a file from the internet and you can directly load it into R from the internet using standard functions. For instance, if you want to read a text file from the internet, you can use read.table(), scan() or readLines().

# For example, we download "http://en.wikibooks.org/wiki/R_Programming/Text_Processing" on our Desktop
download.file(url="http://en.wikibooks.org/wiki/R_Programming/Text_Processing",destfile= "~/Desktop/test_processing.html")
# You can also read it into R using readLines()
text <- readLines("http://en.wikibooks.org/wiki/R_Programming/Text_Processing")

See also RCurl

Computing time[edit | edit source]

If you perform computer intensive task you may want to optimize the computing time. Two functions are available system.time() and proc.time(). Both returns a vector of values. The first is the standard CPU time.

> system.time(x<-rnorm(10^6))
[1] 1.14 0.07 1.83 0.00 0.00
> debut <- proc.time()
> x <- rnorm(10^6)
> proc.time()-debut
[1]  1.66  0.10 10.32  0.00  0.00

Computing process[edit | edit source]

user.prompt() (Zelig) makes a pause in the computation process (useful if you want to do a demo). waitReturn() (cwhmisc) does the same job. Sys.sleep() stop the computation during a few seconds.

> user.prompt()

Press <return> to continue: 
> Sys.sleep(5)

It is possible to stop the computing process if a logical condition is not true using stopifnot().

Miscellanous[edit | edit source]

  • trCopy() (TinnR package) copy an object to the clipboard. It is useful if you want to copy a large object to the clipboard. For instance, if you want to copy the code of a function and paste it in a text editor.
> trCopy(lm)
[1] TRUE
  • sessionInfo() gives information on the current session info (R version + loaded packages). This function may be useful for reproducible computing. getRversion() gives the current R version. R.version gives more details about the computer and R.Version() returns the same informations as a list.

See Also[edit | edit source]

  • See the R.utils package[1]

References[edit | edit source]

  1. Henrik Bengtsson (2009). R.utils: Various programming utilities. R package version 1.1.7. http://CRAN.R-project.org/package=R.utils


Estimation utilities

This page deals with methods which are available for most estimation commands. This can be useful for all kind of regression models.

Formulas[edit | edit source]

Most estimation commands use a formula interface. The outcome is left of the ~ and the covariates are on the right.

y ~ x1 + x2

It is easy to include multinomial variable as predictive variables in a model. If the variable is not already a factor, one just need to use the as.factor() function. This will create a set of dummy variables.

y ~ as.factor(x)

For instance, we can use the Star data in the Ecdat package :

library("Ecdat")
data(Star)
summary(lm(tmathssk ~ as.factor(classk), data = Star))

I() takes arguments "as is". For instance, if you want to include in your equation a modified variable such as a squarred term or the addition of two variables, you may use I().

lm(y ~ x1 + I(x1^2) + x2)
lm(y ~ I(x1 + x2))
lm(I(y-100) ~ I(x1-100) + I(x2 - 100))

It is easy to include interaction between variables by using : or *. : adds all interaction terms whereas * adds interaction terms and individual terms.

lm(y~x1:x2) # interaction term only
lm(y~x1*x2) # interaction and individual terms

It is also possible to generate polynomials using the poly() function with option raw = TRUE.

lm(y ~ poly(x, degree = 3, raw = TRUE))

There is also an advanced formula interface which is useful for instrumental variables models and mixed models. For instance ivreg() (AER) uses this advanced formulas interface. The instrumental variables are entered after the |. See the Instrumental Variables section if you want to learn more.

library("AER")
ivreg(y ~ x | z)

Output[edit | edit source]

In addition to the summary() and print() functions which display the output for most estimation commands, some authors have developed simplified output functions. One of them is the display() function in the arm package. Another one is the coefplot() in the arm package which displays the coefficients with confidence intervals in a plot. According to the standards defined by Nathaniel Beck[1], Jeff Gill developped graph.summary()[2]. This command does not show useless auxiliary statistics.


R code Output
source("http://artsci.wustl.edu/~jgill/Models/graph.summary.R")
N <- 1000
u <- rnorm(N)
x1 <- 1 + rnorm(N)
x2 <- 1 + rnorm(N) + x1
y <- 1 + x1 + x2 + u
graph.summary(lm(y ~ x1 + x2))
Family: gaussian
Link function: identity

             Coef Std.Err. 0.95 Lower 0.95 Upper CIs:ZE+RO
(Intercept) 0.980    0.056      0.871      1.089      |o| 
x1          1.040    0.043      0.955      1.125      |o| 
x2          0.984    0.031      0.923      1.045      |o| 

N: 1000    Estimate of Sigma: 0.998
library("arm")
display(lm(y ~ x1 + x2))
lm(formula = y ~ x1 + x2)
            coef.est coef.se
(Intercept) 0.89     0.05   
x1          1.05     0.04   
x2          1.02     0.03   
---
n = 1000, k = 3
residual sd = 0.96, R-Squared = 0.86

Weights[edit | edit source]

Tests[edit | edit source]

Confidence intervals[edit | edit source]

Delta Method[edit | edit source]

  • If you want to know the standard error of a transformation of one of your parameter, you need to use the delta method
  • deltamethod() in the msm package[3].
  • delta.method() in the alr3 package.
  • deltaMethod in the car package.

Zelig : the pseudo-bootstrap method[edit | edit source]

Zelig[4] is a postestimation package which simulates in the distribution of the estimated parameters and computes the quantities of interest such as marginal effects or predicted probabilities. This is especially useful for non-linear models. Zelig comes with a set of vignettes which explain how to deal with each kind of model. There are three commands.

  • zelig() estimates the model and draws from the distribution of estimated parameters.
  • setx() fixes the values of explanatory variables.
  • sim() computes the quantities of interest.

References[edit | edit source]

  1. Nathaniel Beck "Making regression and related output more helpful to users" The Political Methodologist 2010 http://politics.as.nyu.edu/docs/IO/2576/beck_tpm_edited.pdf
  2. Jeff Gill graph.summary() http://artsci.wustl.edu/~jgill/Models/graph.summary.s
  3. See the example on the UCLA Statistics webpage : http://www.ats.ucla.edu/stat/r/faq/deltamethod.htm
  4. Kosuke Imai, Gary King and Olivia Lau (2009). Zelig: Everyone's Statistical Software. R package version 3.4-5. http://CRAN.R-project.org/package=Zelig


Packages

An R package includes a set of functions and datasets. Packages are often developed as supplementary material to books. For instance the MASS package was developed by Venables and Ripley for their book Modern Applied Statistics with S and the car package was developed by John Fox for his book An R and S plus Companion to Applied Regression.

Load a package[edit | edit source]

A package is loaded into the current R environment using the library() function. A list of functions and datasets included in a package can be obtained by using the h or help argument of the library function.

library("stats4") # loads the package "stats4"
library(h=stats4) # gives help for all functions
data(package="stats4") # gives the list of all available datasets

A package can be detached from the current environment by using the detach() function:

> detach("package:prettyR")

Without any arguments the library() function lists all of the packages currently available to the user. env() (gdata) describe all loaded environments (ie packages). search() gives the list of all loaded packages.

> library() # returns the description of all the packages available on the computer
> dir(.libPaths()) # returns the name of all the packages available on the computer (quicker than the previous one)
> search()
> env(unit="MB")

current.packages() (Zelig) show all the required and suggested packages.

> current.packages("sem")

Where are my packages stored?

  • The .libPaths() function without arguments prints the library directories
  • The .libPaths() function with a directory as argument defines a new directory where to store new libraries.
> .libPaths()
[1] "/Users/username/Library/R/library"
[2] "/Library/Frameworks/R.framework/Resources/library"
> .libPaths("W:/AppData/R/library")

Install new packages[edit | edit source]

  • Each major distribution of R includes a 'base' set of packages which support many basic statistical functions.
  • Many R Users also choose to install additional 'Add-on' packages to provide simplified interfaces to R commands or to add specialist functionality i.e. the ggplot Grammar of Graphics package provides an advanced graphical output capability.
  • The exhaustive list of all available packages is on the CRAN website.
  • The R community has developed a vast resource of Add-on packages, some with unique functionality, some with overlapping functionality. It is therefore common to find multiple R packages capable of completing the same task i.e. reading and writing Excel spreadsheets. Ultimately which package to use is your choice.
  • To install a new package, it is usually necessary to specify the name of the package as an argument of install.packages() function.
  • Sometimes you need to specify more options. For instance, this is the case if you are not an administrator of your computer.
    • "lib" specifies the directory where you want to store the package.
    • "repos" specifies a list of repositories. Note that you can specify a vector of repositories.
    • "dep=T" specifies that all the required packages are also downloaded and installed.
> install.packages("faraway")
> install.packages("rgrs", lib="W:/AppData/R/library" , 
repos=c("http://r-forge.r-project.org","http://cran.fr.r-project.org/"), 
dep=TRUE)
  • Stay up to date.

If you want to be aware of the latest packages, type new.packages() in R or visit the Revolution Computing Blog which gives each month a list of the new and the updated packages.

> new.packages() # displays all the packages available in the repositories
> update.packages() # updates all the packages installed with the newest version available in the repositories

We can also install bundles of packages using install.views() or update.views() (ctv).

> install.packages("ctv")
> library("ctv")
> install.views("Econometrics")
> update.views("Econometrics")

We can also remove packages with remove.packages().

Package Documentation and Help[edit | edit source]

All R packages install with 'help' documentation, listing their functions and providing syntax and usage examples.

> library("tidyr") # load the tidyr package
> help("tidyr")    # view the tidyr package's help documentation

See the Obtaining Help Documentation section for more details on accessing package 'help' documentation.

Package Dependencies[edit | edit source]

  • Most R packages have dependencies or references to other R packages. You must have all of an R package's 'required' dependencies installed, before you can use the package.
  • R package dependencies come in two types, required and suggested.
  • Specialist R packages such as the ggplot Grammar of Graphics packages have large package dependency trees.
  • The install.packages() function will automatically download and install a package and its dependencies, on a computer with an Internet connection.
  • The R CMD INSTALL utility will check preinstalled packages for dependencies, but not download missing packages.
  • Users must follow separate package download and installation processes when working on a computer with no Internet connection. The miniCRAN package can be used to assist in the offline management of R package dependencies.

Building R Packages[edit | edit source]

You can write down your own R packages. But, all packages submitted to CRAN (or Bioconductor) must follow specific guidelines, including the folder structure of the package and the other files like DESCRIPTION, NAMESPACE and so on.


  • See Friedrich Leisch's introduction (PDF 20 pages)[1]
  • See also Duncan Murdoch's tools for building packages using Windows[2]
  • See also Hadley Wickham and Jennifer Bryan's online book on current packaging practices (R Packages) [3]

References[edit | edit source]

  1. Friedrich Leisch Creating R Packages : A Tutorial http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf
  2. http://www.r-project.org/conferences/useR-2008/slides/Murdoch.pdf
  3. Hadley Wickham and Jennifer Bryan R Packages : Organize, Test, Docment and Share your code https://r-pkgs.org/
Previous: Settings Index Next: Documentation


Data types

Data types[edit | edit source]

Vectors are the simplest R objects, an ordered list of primitive R objects of a given type (e.g. real numbers, strings, logicals). Vectors are indexed by integers starting at 1. Factors are similar to vectors but where each element is categorical, i.e. one of a fixed number of possibilities (or levels). A matrix is like a vector but with a specific instruction for the layout such that it looks like a matrix, i.e. the elements are indexed by two integers, each starting at 1. Arrays are similar to matrices but can have more than 2 dimensions. A list is similar to a vector, but the elements need not all be of the same type. The elements of a list can be indexed either by integers or by named strings, i.e. an R list can be used to implement what is known in other languages as an "associative array", "hash table", "map" or "dictionary". A dataframe is like a matrix but does not assume that all columns have the same type. A dataframe is a list of variables/vectors of the same length. Classes define how objects of a certain type look like. Classes are attached to object as an attribute. All R objects have a class, a type and a dimension.

> class(object)
> typeof(object)
> dim(object)

Vectors[edit | edit source]

You can create a vector using the c() function which concatenates some elements. You can create a sequence using the : symbol or the seq() function. For instance 1:5 gives all the number between 1 and 5. The seq() function lets you specify the interval between the successive numbers. You can also repeat a pattern using the rep() function. You can also create a numeric vector of missing values using numeric(), a character vector of missing values using character() and a logical vector of missing values (ie FALSE) using logical()

> c(1,2,3,4,5)
[1] 1 2 3 4 5
> c("a","b","c","d","e")
[1] "a" "b" "c" "d" "e"
> c(T,F,T,F)
[1]  TRUE FALSE  TRUE FALSE

> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(1,5)
[1] 1 2 3 4 5
> seq(1,5,by=.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> rep(1,5)
[1] 1 1 1 1 1
> rep(1:2,5)
 [1] 1 2 1 2 1 2 1 2 1 2
> numeric(5)
[1] 0 0 0 0 0
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(5)
[1] "" "" "" "" ""

The length() computes the length of a vector. last() (sfsmisc) returns the last element of a vector but this can also be achieved simply without the need for an extra package.

x <- seq(1,5,by=.5)    # Create a sequence of number
x                      # Display this object
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> length(x)            # Get length of object x
 [1] 9
> library(sfsmisc)
> last(x)              # Select the last element of x  
 [1] 5.0
> x[length(x)]         # Select the last element wihout an extra package.
 [1] 5.0

Factors[edit | edit source]

factor() transforms a vector into a factor. A factor can also be ordered with the option ordered=T or the function ordered(). levels() returns the levels of a factor. gl() generates factors. n is the number of levels, k the number of repetition of each factor and length the total length of the factor. labels is optional and gives labels to each level.

Factors can be most easily thought of as categorical variables. An important function for factor analysis is the table() function, which offers a type of summary. When considering the types of statistical data (nominal, ordinal, interval and ratio), factors can be nominal, ordinal or interval. Nominal factors are categorical names, examples of which could be country names paired with some other information. An example of an ordinal factor would be a set of race times for a particular athlete paired with the athlete's finishing place (first, second, ...). When trying to summarize this factor, please see the example with ordinal examples below for an example on self-ordering your factors. Finally, an example of interval level factors would be age brackets such as "20 - 29", "30 - 39", etc. In general, R can automatically order numbers stored as factors appropriately but a programmer may use the same techniques with this type of data to order in the manner most appropriate to their application.

See also is.factor(), as.factor(), is.ordered() and as.ordered().

 
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes   no    yes   maybe maybe no    maybe no    no   
Levels: maybe no yes
> 
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"), ordered = T)
[1] yes   no    yes   maybe maybe no    maybe no    no   
Levels: maybe < no < yes
> 
> ordered(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes   no    yes   maybe maybe no    maybe no    no   
Levels: maybe < no < yes
>
> ordered(as.factor(c("First","Third","Second","Fifth","First","First","Third")),
+ levels = c("First","Second","Third","Fourth","Fifth"))
[1] First  Third  Second Fifth  First  First  Third 
Levels: First < Second < Third < Fourth < Fifth
>
>  gl(n=2, k=2, length=10, labels = c("Male", "Female")) # generate factor levels
 [1] Male   Male   Female Female Male   Male   Female Female Male   Male  
Levels: Male Female

Matrix[edit | edit source]

  • If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by column (the default option). Here are two examples.
> matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
     [,1] [,2] [,3] [,4] [,5]
[1,]   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA
[3,]   NA   NA   NA   NA   NA
[4,]   NA   NA   NA   NA   NA
[5,]   NA   NA   NA   NA   NA
> matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]    1    2    3    4    5
[5,]    6    7    8    9   10
  • Functions cbind() and rbind() combine vectors into matrices in a column by column or row by row mode:
> v1 <- 1:5
> v2 <- 5:1
> v2
[1] 5 4 3 2 1
> cbind(v1,v2)
     v1 v2
[1,]  1  5
[2,]  2  4
[3,]  3  3
[4,]  4  2
[5,]  5  1

> rbind(v1,v2)
   [,1] [,2] [,3] [,4] [,5]
v1    1    2    3    4    5
v2    5    4    3    2    1
  • The dimension of a matrix can be obtained using the dim() function. Alternatively nrow() and ncol() returns the number of rows and columns in a matrix:
> X <- matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
> dim(X)
[1] 5 5
> nrow(X)
[1] 5
> ncol(X)
[1] 5
  • Function t() transposes a matrix:
> t(X)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11    1    6
[2,]    2    7   12    2    7
[3,]    3    8   13    3    8
[4,]    4    9   14    4    9
[5,]    5   10   15    5   10
  • Unlike data frames matrices must either be numeric or character in type:
> a=matrix(2,2,2)
> a
     [,1] [,2]
[1,]    2    2
[2,]    2    2
> a = rbind(a,c("A","A"))
> a
     [,1] [,2]
[1,] "2"  "2" 
[2,] "2"  "2" 
[3,] "A"  "A"

Arrays[edit | edit source]

An array is composed of n dimensions where each dimension is a vector of R objects of the same type. An array of one dimension of one element may be constructed as follows.

> x <- array(c(T,F),dim=c(1))
> print(x)
[1] TRUE

The array x was created with a single dimension (dim=c(1)) drawn from the vector of possible values c(T,F). A similar array, y, can be created with a single dimension and two values.

> y <- array(c(T,F),dim=c(2))
> print(y)
[1]  TRUE FALSE

A three dimensional array - 3 by 3 by 3 - may be created as follows.

> z <- array(1:27,dim=c(3,3,3))
> dim(z)
[1] 3 3 3
> print(z)
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

R arrays are accessed in a manner similar to arrays in other languages: by integer index, starting at 1 (not 0). The following code shows how the third dimension of the 3 by 3 by 3 array can be accessed. The third dimension is a 3 by 3 array.

> z[,,3]
     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

Specifying two of the three dimensions returns an array on one dimension.

> z[,3,3]
[1] 25 26 27

Specifying three of three dimension returns an element of the 3 by 3 by 3 array.

> z[3,3,3]
[1] 27

More complex partitioning of array may be had.

> z[,c(2,3),c(2,3)]
, , 1

     [,1] [,2]
[1,]   13   16
[2,]   14   17
[3,]   15   18

, , 2

     [,1] [,2]
[1,]   22   25
[2,]   23   26
[3,]   24   27

Arrays need not be symmetric across all dimensions. The following code creates a pair of 3 by 3 arrays.

> w <- array(1:18,dim=c(3,3,2))
> print(w)
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

Objects of the vectors composing the array must be of the same type, but they need not be numbers.

> u <- array(c(T,F),dim=c(3,3,2))
> print(u)
, , 1

      [,1]  [,2]  [,3]
[1,]  TRUE FALSE  TRUE
[2,] FALSE  TRUE FALSE
[3,]  TRUE FALSE  TRUE

, , 2

      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE  TRUE
[3,] FALSE  TRUE FALSE


Lists[edit | edit source]

A list is a collection of R objects. list() creates a list. unlist() transform a list into a vector. The objects in a list do not have to be of the same type or length.

> x <- c(1:4)
> y <- FALSE
> z <- matrix(c(1:4),nrow=2,ncol=2)
> myList <- list(x,y,z)
> myList
 [[1]]
[1] 1 2 3 4

 [[2]]
[1] FALSE

 [[3]]
     [,1] [,2]
[1,]    1    2
[2,]    3    4

lists have very flexible methods for reference

  • by index number:
> a <- list()
> a
list()
> a[[1]] = "A"
> a
[[1]]
[1] "A"

> a[[2]]="B"
> a
[[1]]
[1] "A"

[[2]]
[1] "B"
  • By name:
> a
list()
> a$fruit = "Apple"
> a
$fruit
[1] "Apple"

> a$color = "green"
> a
$fruit
[1] "Apple"

$color
[1] "green"
  • This can also be recursive and in combination
> a = list()
> a[[1]] = "house"
> a$park = "green's park"
> a
[[1]]
[1] "house"

$park
[1] "green's park"


> a$park = "green's park"
> a[[1]]$address = "1 main st."

> a
[[1]]
[[1]][[1]]
[1] "house"

[[1]]$address
[1] "1 main st."


$park
[1] "green's park"

Using the scoping rules in R one can also dynamically name and create list elements

>  a <- list()
>  n <- 1:10
>  fruit = paste("number of coconuts in bin",n)
> my.number = paste("I have",10:1,"coconuts")
> for (i in 1:10)a[fruit[i]] = my.number[i]
>  a$'number of coconuts in bin 7'
[1] "I have 4 coconuts"

Data Frames[edit | edit source]

A dataframe has been referred to as "a list of variables/vectors of the same length". In the following example, a dataframe of two vectors is created, each of five elements. The first vector, v1, is composed of a sequence of the integers 1 through 5. A second vector, v2, is composed of five logical values drawn of type T and F. The dataframe is then created, composed of the vectors. The columns of the data frame can be accessed using integer subscripts or the column name and the $ symbol.

> v1 <- 1:5
> v2 <- c(T,T,F,F,T)
> df <- data.frame(v1,v2)
> print(df)
  v1    v2
1  1  TRUE
2  2  TRUE
3  3 FALSE
4  4 FALSE
5  5  TRUE
> df[,1]
 [1] 1 2 3 4 5
> df$v2
 [1] TRUE TRUE FALSE FALSE TRUE

The dataframe may be created directly. In the following code, the dataframe is created - naming each vector composing the dataframe as part of the argument list.

> df <- data.frame(foo=1:5,bar=c(T,T,F,F,T))
> print(df)
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE
5   5  TRUE

External links[edit | edit source]



Previous: Sample Session Index Next: Settings


Working with data frames

In this section, we deal with methods to read, manage and clean-up a data frame.

In R, a dataframe is a list of vectors of the same length. They don't have to be of the same type. For instance, you can combine in one dataframe a logical, a character and a numerical vector.

Reading and saving data[edit | edit source]

If data are already in an R format (.Rda or .Rdata), you can load them in memory using load(). You can save data to the R format using save().

load("mydata.Rda")
save(list='mydata',file="mydata.Rda")

Example Datasets[edit | edit source]

  • Most packages include example datasets to test the functions.
  • The data() function without argument gives the list of all example datasets in all the loaded packages.
  • If you want to load them in memory, you just need to use the data function and include the name of the dataset as an argument.
  • str_data() (sfsmisc) gives the structure of all datasets in a package.
> data() # lists all the datasets in all the packages in memory
> data(package="datasets") # lists all the datasets in the "datasets" package
> data(Orange) # loads the orange dataset in memory
> ?Orange # Help for the "Orange" Datasets
> str_data("datasets") # gives the structure of all the datasets in the datasets package.
  • Some packages include lots of datasets.
    • The datasets package
    • The AER package [1] includes replication datasets for some important textbooks in econometrics.
    • The EcDat package [2] includes replication archive for the Journal of Applied Econometrics, the Journal of Business and Economic Statistics, etc.

Building your own data frames[edit | edit source]

You can create a dataframe using vectors.

N <- 100
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- 1 + x1 + x2 + u
mydat <- data.frame(y,x1,x2)

R has a spreadsheet-style data editor. One can use it to enter the data into a spreadsheet.

mydat <- edit(data.frame())

Read table from the clipboard :

> mydat <- read.table("clipboard")

You can also read space delimited tables in your code using gsource() (Zelig). Here is an example with Yule 1899 data.[3]

mydat <- gsource(var.names = "id union pauperism out old  pop", 
variables = "
1 Kensington 27 5 104 136
2 Paddington  47 12 115 111
3 Fulham 31 21 85 174
")

You can change the column names for a dataFrame.

c1 <- c('A','B','C')
c2 <- c('Alpha','Bravo','Charlie')
c3 <- c('1','2','3')
mydf <- data.frame(c1,c2,c3)
colnames(mydf) <- c('ColName1','ColName2','ColName3')

Describing a data frame[edit | edit source]

There are various ways to inspect a data frame, such as:

  • str(df) gives a very brief description of the data
  • names(df) gives the name of each variable
  • summary(df) gives some very basic summary statistics for each variable
  • head(df) shows the first few rows
  • tail(df) shows the last few rows.

Browsing data[edit | edit source]

  • You can browse your data in a spreadsheet using View(). Depending on your operating system, this option is not always available and the result is not always the same.
  • You can print the first lines using head() and the last lines using tail().
View(mydata)
head(mydata, n = 20) # n = 20 means  that the first 20 lines are printed in the R console
  • RStudio has a nice data browser (View(mydata)).
  • RKward has also a nice data browser
  • Paul Murrell is currently developing the rdataviewer package (pdf).

Binding row or column[edit | edit source]

Most of the times when you are working with data frames, you are changing the data and one of the several changes you can do to a data frame is adding column or row and as the result increase the dimension of your data frame. There are few different ways to do it but the easiest ones are cbind() and rbind() which are part of the base package:

mydata <- cbind(mydata, newVector)
mydata <- rbind(mydata, newVector)

Remember that the length of the newVector should match the length of the side of the data frame that you are attaching it to. For example, in the cbind() command the following statement should be TRUE:

dim(mydata)[1]==length(newVector)

To see more samples, you can always do ?base::cbind and ?base::rbind.

Attaching data[edit | edit source]

One of the big advantages of R over Stata is that you can deal with multiple datasets at the same time. You just need to specify the name of the dataset and a "$" symbol before each variable name ( for instance mydat1$var1 and mydat2$var1). If you only work with one dataset and you don't want to write again and again the name of the dataset as a prefix for each variable, you can use attach().

mydata$var1
attach(mydata)
var1
detach(mydata)

Detecting duplicates[edit | edit source]

When you want to clean up a data set, it is very often useful to check if you don't have the same information twice in the data. R provides some functions to detect duplicates.

  • duplicated() looks at duplicated elements and returns a logical vector. You can use table() to summarize this vector.
  • Duplicated() (sfsmisc) generalizes this command. Duplicated() only marks unique values with "NA".
  • remove.dup.rows() (cwhmisc).
  • unique() keeps only the unique lines in a dataset.
  • distinct() (dplyr) retains only unique/distinct rows from a dataset.


library("Zelig")
mydat <- gsource(
variables = "
1 1 1 1
1 1 1 1
1 2 3 4
1 2 3 4
1 2 2 2
1 2 3 2")
unique(mydat) # keep unique rows
library(cwhmisc)
remove.dup.rows(mydat) # similar to unique()
table(duplicated(mydat)) # table duplicated lines
mydat$dups <- duplicated(mydat) # add a logical variable for duplicates

Creating and removing variables[edit | edit source]

To create a new variable

mydata$newvar <- oldvar

If you want to delete a variable in a dataset, you can assign NULL to that variable :

# Delete the x variable in the df data frame.
df$x <- NULL

Renaming variables[edit | edit source]

  • It is possible to rename a variable by redefining the vector of names of a data frame.
  • There is also a rename() function in the reshape package.
df <- data.frame(x = 1:10, y = 21:30)
names(df)
names(df) <- c("toto","tata")
names(df)
names(df)[2] <- "titi"
names(df)

Creating a subset of the data[edit | edit source]

One can subset the data using subset(). The first argument is the name of the dataset, the second argument is a logical condition which say which lines will be included in the new dataset and the last argument is the list of variable which will be included in the new dataset.

In the following example, we generate a fake dataset and we use the subset() command to select the lines and columns of interest. We choose the lines such that x1 > 0 and x2 < 0 and we only keep x1 and x2 as variables.

N <- 100
x1 <- rnorm(N)
x2 <- 1 + rnorm(N) + x1
x3 <- rnorm(N) + x2
mydat <- data.frame(x1,x2,x3)
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = - x3) # the same.

It is also possible to reorder the columns using the select option.

subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x2,x1))

Sorting and ordering[edit | edit source]

  • order()
mydat[order(var1,var2),]

Suppose you want to randomize the order in a data set. You just need to generate a vector from a uniform distribution and to sort following that vector.

df[order(runif(nrow(df))),]

Detecting missing values[edit | edit source]

  • is.na() returns a logical vector equal to TRUE if any of the variable in a dataset is missing and to FALSE otherwise.
  • complete.cases() returns a logical vector indicating TRUE if all cases are complete and FALSE otherwise.
> table(complete.cases(df))

Reshaping a dataframe[edit | edit source]

This topic is important if you deal with panel data. Panel data can be stored in a wide format with one observation per unit and a variable for each time period or in a long format with one observation per unit and time period. reshape() reshapes a dataset in a wide or long format.

> country <- c("'Angola'","'UK'","'France'")
> gdp.1960 <- c(1,2,3)
> gdp.1970 <- c(2,4,6)
> mydat <- data.frame(country,gdp.1960,gdp.1970)
> mydat # wide format
  country gdp.1960 gdp.1970
1  Angola       1       2
2      UK       2       4
3  France       3       6
> reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format
    country time gdp id
1.1  Angola    1   1  1
2.1      UK    1   2  2
3.1  France    1   3  3
1.2  Angola    2   2  1
2.2      UK    2   4  2
3.2  France    2   6  3
  • varying gives the numbers of the columns which are time-varying
  • v.names gives the prefix of the time-varying variables
  • direction gives the direction, either "long" or "wide".
  • See also :
    • reShape() (Hmisc)
    • See Hadley Wickham's reshape package[4]
    • See Duncan Murdoch's tables package [5]

External links[edit | edit source]

Expanding a dataset[edit | edit source]

Sometimes we need to duplicate some lines in a dataset. For instance, if we want to generate a fake dataset with a panel data structure. In that case, we would first generate time invariant variables and then duplicate each line by a given scalar in order to create time-varying variables.

It is possible to use the expand() function in the epicalc package (since this package does not exist anymore, an option to expand is given in [1]). This will multiply each line by a given number.

N <- 1000
T <- 5
wide <- data.frame(id = 1:N,f = rnorm(N),  rep = T)
library("epicalc")
long <- expand(wide,index.var = "rep")
long$time <- rep(1:T,N)

We can also use the do it yourself solution or create our own function. The idea is simple. We create a vector which igives for each line the number of times it should be replicated (dups in the following example). Then we use the rep() function to create a vector which repeats the line numbers according to what we want. The last step creates a new dataset which repeats lines according to the desired pattern.

expand <- function(df,dups){
	df$dups <- dups
	pattern <- rep(1:nrow(df), times=df$dups)
	df2 <- df[pattern,]
	index <- function(x){
		1:length(x)
		}
	df2$year <- unlist(tapply(df2$dups, df2$id, index))
	df2$dups <- NULL 
	return(df2)
	}

df <- data.frame(x = rnorm(3), id = 1:3)
dups = c(3,1,2)
expand(df,dups)

Merging dataframes[edit | edit source]

Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :

We have one table describing authors :

> authors <- data.frame(
+     surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+     nationality = c("US", "Australia", "US", "UK", "Australia"),
+     deceased = c("yes", rep("no", 4)))
> authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no

and one table describing books

> books <- data.frame(
+     name = I(c("Tukey", "Venables", "Tierney",
+              "Ripley", "Ripley", "McNeil", "R Core")),
+     title = c("Exploratory Data Analysis",
+               "Modern Applied Statistics ...",
+               "LISP-STAT",
+               "Spatial Statistics", "Stochastic Simulation",
+               "Interactive Data Analysis",
+               "An Introduction to R"),
+     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                      "Venables & Smith"))
> books
      name                         title     other.author
1    Tukey     Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics ...           Ripley
3  Tierney                     LISP-STAT             <NA>
4   Ripley            Spatial Statistics             <NA>
5   Ripley         Stochastic Simulation             <NA>
6   McNeil     Interactive Data Analysis             <NA>
7   R Core          An Introduction to R Venables & Smith

We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the observations from the author dataset which match with an observation in the books dataset.

> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
> final
      name                         title     other.author nationality deceased
1    Tukey     Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics ...           Ripley   Australia       no
3  Tierney                     LISP-STAT             <NA>          US       no
4   Ripley            Spatial Statistics             <NA>          UK       no
5   Ripley         Stochastic Simulation             <NA>          UK       no
6   McNeil     Interactive Data Analysis             <NA>   Australia       no
7   R Core          An Introduction to R Venables & Smith        <NA>     <NA>

It is also possible to merge two data.frame objects while preserving the rows’ order by one of the two merged objects.[6]

Resources[edit | edit source]

References[edit | edit source]

  1. The AER Package http://cran.r-project.org/web/packages/AER/index.html
  2. The EcDat Package http://cran.r-project.org/web/packages/Ecdat/index.html
  3. "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule - Journal of the Royal Statistical Society, June 1899, p 283
  4. Reshaping Data with the reshape Package : http://www.jstatsoft.org/v21/i12
  5. vignette for the tables package: http://cran.r-project.org/web/packages/tables/vignettes/tables.pdf
  6. Merging data frames while preserving the rows
  7. R Data Manual http://cran.r-project.org/doc/manuals/R-data.html
  8. Paul Murrell introduction to Data Technologies http://www.stat.auckland.ac.nz/~paul/ItDT/


Previous: Random Number Generation Index Next: Importing and exporting data


Importing and exporting data

Data can be stored in a large variety of formats. Each statistical package has its own format for data (xls for Microsoft Excel, dta for Stata, sas7bdat for SAS, ...). R can read almost all file formats. We present a method for each kind of file. If none of the following methods work, you can use a specific software for data conversion such as the free software OpenRefine or the commercial software Stat Transfer.[1] In any case, most statistical software can export data in a CSV (comma separated values) format and all of them can read CSV data. This is often the best solution to make data available to everyone.

Graphical user interfaces[edit | edit source]

Some IDE or GUI provides some press button solution to import data.

You may also have a look at speedR, a graphical user interface which helps at importing data from Excel, OpenOfficeCalc, CSV and other text files.[2]

library(speedR)
speedR()

CSV (csv,txt,dat)[edit | edit source]

You can import data from a text file (often CSV) using read.table(), read.csv() or read.csv2(). The option header = TRUE indicates that the first line of the CSV file should be interpreted as variables names and the option sep = gives the separator (generally "," or ";").

csv.get() (Hmisc) is another possibility.

mydata <- read.table("data.txt",header=TRUE)
mydata <- read.table("data.csv", header = TRUE, sep=",")  # import from a CSV
mydata <- read.csv("data.csv", header=T)
mydata <- read.table("data.csv", header = TRUE, sep=";") 
mydata <- read.csv2("data.csv", header=T)

Note that there is no problem if your data are stored on the internet.

df <- read.table("http://www.mywebsite.com/.../data.csv", header = TRUE, sep = ",")

By default, strings are converted to factors. If you want to avoid this conversion, you can specify the option stringsAsFactors = FALSE.

You can export data to a text file using write.table().

write.table(mydat,file="mydat.csv",quote=T,append=F,sep=",",eol = "\n", na = "NA", dec = ".", row.names = T,col.names = T)

For large CSV files, it is possible to use the ff package.[3]

library("ff")
df <- read.csv.ffdf(file="large_csv_file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000)

Fixed width text files[edit | edit source]

read.fwf() and write.fwf().

Some fixed width text files are provided with a SAS script to import them. Anthony Damico has created SAScii package to easily import those data.[4]

Unstructured text files[edit | edit source]

Stata (dta)[edit | edit source]

  • We can read Stata data using read.dta() in the foreign package and export to Stata data format using write.dta().
  • Note that string variables in Stata are limited to 244 characters. This can be an issue during the exportation process.
  • See also Stata.file() in the memisc package and stata.get in the Hmisc package.
> library("foreign")
> mydata <- read.dta("mydata.dta",convert.dates = TRUE, convert.factors = TRUE, convert.underscore = TRUE)
> names(mydata)
> write.dta(mydata, file = "mydata.dta")

SAS (sas7bdat)[edit | edit source]

Experimental support for SAS databases having the sas7bdat extension is provided by the sas7bdat[5] package. However, sas7bdat files generated by 64 bit versions of SAS, and SAS running on non-Microsoft Windows platforms are not yet supported.

SAS (xpt)[edit | edit source]

  • See also sasexport.get() and sas.get() in the Hmisc
  • See also the SASxport package.
library("foreign")
mydata<-read.xport("SASData.xpt")
names(mydata)

SPSS (sav)[edit | edit source]

  • read.spss() (foreign) and spss.get() (Hmisc)
> library("foreign")
> mydata<-read.spss("SPSSData.sav")
> names(mydata)

EViews[edit | edit source]

readEViews() in the hexView package for EViews files.

Excel (xls,xlsx)[edit | edit source]

Importing data from Excel is not easy. The solution depends on your operating system. If none of the methods below works, you can always export each Excel spreadsheets to CSV format and read the CSV in R. This is often the simplest and quickest solution.

XLConnect supports reading and writing both xls and xlsx file formats. Since it is based on Apache POI it only requires a Java installation and as such works on many platforms including Windows, UNIX/Linux and Mac. Besides reading & writing data it provides a number of additional features such as adding plots, cell styling & style actions and many more.

require("XLConnect")
wb <- loadWorkbook("myfile.xls", create = FALSE)
# Show a summary of the workbook (shows worksheets,
# defined names, hidden sheets, active sheet name, ...)
summary(wb)
# Read data from a worksheet interpreting the first row as column names
df1 <- readWorksheet(wb, sheet = "mysheet")
# Read data from a named region/range interpreting the first row as column
# names
df2 <- readNamedRegion(wb, name = "myname", header = TRUE)

The RODBC solution:

library("RODBC")
32-bit Windows: channel <- odbcConnectExcel("Graphiques pourcent croissance.xls") # creates a connection
64-bit Windows: channel <- odbcConnectExcel2007("Graphiques pourcent croissance.xls")
sqlTables(channel) # List all the tables
effec <- sqlFetch(channel, "effec") # Read one spreadsheet as an R table
odbcClose(channel) # close the connection (don't forget)

The xlsReadWrite package (actually, this package does not exist on CRAN repos, but you can download old versions from CRAN archive).

> library(xlsReadWrite)
mydat <- read.xls("myfile.xls", colNames = T, sheet = "mysheet", type = "data.frame", from = 1, checkNames = TRUE)
  • "sheet" specifies the name or the number of the sheet you want to import.
  • "from" specifies the first row of the spreadsheet.

The gnumeric package.[6] This package use an external software called ssconvert which is usually installed with gnumeric, the Gnome office spreadsheet. The read.gnumeric.sheet() function reads xls and xlsx files.

library("gnumeric")
df1 <- read.gnumeric.sheet(file = "df.xls", head = TRUE, sheet.name = "Feuille1")
df2 <- read.gnumeric.sheet(file = "df.xlsx", head = TRUE, sheet.name = "Feuille1")

See also xlsx for Excel 2007 documents and read.xls() (gdata).

Google Spread Sheets[edit | edit source]

You should make the spreadsheet public, publish it as a CSV file. Then you can read it in R using read.csv(). See more on the Revolution's computing blog (link). See also RGoogleDocs (link).

# Read from a Google SpreadSheet.
require(RCurl)
myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AkuuKBh0jM2TdGppUFFxcEdoUklCQlJhM2kweGpoUUE&single=true&gid=0&output=csv")
read.csv(textConnection(myCsv))

gnumeric spreadsheets[edit | edit source]

The gnumeric package[6]. read.gnumeric.sheet() reads one sheet and read.gnumeric.sheets() reads all sheets and store them in a list.

library("gnumeric")
df <- read.gnumeric.sheet(file = "df.gnumeric", head = TRUE, sheet.name = "df.csv")
View(df)
df <- read.gnumeric.sheets(file = "df.gnumeric", head = TRUE)
View(df$df.csv)

OpenOffice and LibreOffice (ods)[edit | edit source]

readODS does not require external dependencies, making it crossplatform.

library("readODS")
df=read.ods("df.ods")

speedR is another alternative.

library("speedR")
df <- speedR.importany(file = "df.ods")

Note that you can also use the speedR graphical user interface (speedR()) which will return the command line for replication.

library("speedR")
speedR()

JSON[edit | edit source]

JSON (JavaScript Object Notation) is a very common format on the internet. The rjson library makes it easy to import data from a json format[7].

# json.txt : a text file including data in the JSON format
library("rjson")
df <- fromJSON(paste(readLines("json.txt"), collapse=""))

Is is easy to export a list or a dataframe to a JSON format using the toJSON() function :

# df : a data frame
library("rjson")
json <- toJSON(df)

Sometimes the JSON data can be more complex with structures such as nested arrays. In this case you may find it more useful to use an online converter like json-csv.com to convert the file to CSV. Then import the resulting data as per the CSV instructions above.

dBase (dbf)[edit | edit source]

read.dbf() in the foreign package.

library("foreign")
df  <- read.dbf("file.dbf")
str(df)

Hierarchical Data Format (hdf5)[edit | edit source]

hdf5 data can be read using the hdf5 package[8].

SQL[edit | edit source]

DICOM and NIfTI[edit | edit source]

  • See "Working with the {DICOM} and {NIfTI} Data Standards in R" in the Journal of Statistical Software[9]

Resources[edit | edit source]

  • R Data Manual[10].
  • Paul Murrell's Introduction to Data Technologies[11].

References[edit | edit source]

  1. Stat Transfer
  2. speedR
  3. "Opening Large CSV Files in R". Retrieved March 7, 2013. {{cite web}}: Unknown parameter |site= ignored (help)
  4. David Smith. "Importing public data with SAS instructions into R". Revolution Analytics. Retrieved February 1, 2013.
  5. sas7bdat
  6. a b This command has been tested using Ubuntu 10.10 and R 2.11.1
  7. http://cran.r-project.org/web/packages/rjson/index.html
  8. http://cran.r-project.org/web/packages/hdf5/index.html
  9. Brandon Whitcher, Volker J. Schmid, Andrew Thorton "Working with the {DICOM} and {NIfTI} Data Standards in R", Journal of Statistical Software Vol. 44, Issue 6, Oct 2011, link
  10. R Data Manual
  11. Paul Murrell introduction to Data Technologies
Previous: Data Management Index Next: Graphics


Text Processing

This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.

This page may be useful to :

  • perform statistical text analysis.
  • collect data from an unformatted text file.
  • deal with character variables.

In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of function for characters, simple functions and regular expressions. Many functions are part of the standard R base package.

help.search(keyword = "character", package = "base")

However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the stringr package which defines functions with similar behaviour but their names are easier to retain and their syntax much more systematic[1].

  • Keywords : text mining, natural language processing
  • See CRAN Task view on Natural Language Processing[2]
  • See also the following packages tm, tau, languageR, scrapeR.


Reading and writing text files[edit | edit source]

R can read any text file using readLines() or scan(). It is possible to specify the encoding of the imported text file with readLines(). The entire contents of the text file can be read into an R object (e.g., a character vector). scan() is more flexible. The kind of data expected can be specified in the second argument (e.g., character(0) for a string).

text <- readLines("file.txt",encoding="UTF-8")
scan("file.txt", character(0)) # separate each word
scan("file.txt", character(0), quote = NULL) # get rid of quotes
scan("file.txt", character(0), sep = ".") # separate each sentence
scan("file.txt", character(0), sep = "\n") # separate each line

We can write the content of an R object into a text file using cat() or writeLines(). By default cat() concatenates vectors when writing to the text file. You can change it by adding options sep="\n" or fill=TRUE. The default encoding depends on your computer.

cat(text,file="file.txt",sep="\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = FALSE)

Before reading a text file, you can look at its properties. nlines() (parser package) and countLines() (R.utils package) count the number of lines in the file. count.chars() (parser package) counts the number of bytes and characters in each line of a file. You can also display a text file using file.show().

Character encoding[edit | edit source]

R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which have been created with another operating system and especially if the language is not English and has many accents and specific characters. For instance, the standard encoding scheme in Linux is "UTF-8" whereas the standard encoding scheme in Windows is "Latin1". The Encoding() functions returns the encoding of a string. iconv() is similar to the unix command iconv and converts the encoding.

  • iconvlist() gives the list of available encoding scheme on your computer.
  • readLines(), scan() and file.show() have also an encoding option.
  • is.utf8() (tau) tests if the encoding is "utf8".
  • is.locale() (tau) tests if encoding is the same as the default encoding on your computer.
  • translate() (tau) translates the encoding into the current locale.
  • fromUTF8() (descr) is less general than iconv().
  • utf8ToInt() (base)

Example[edit | edit source]

The following example was run under Windows. Thus, the default encoding is "latin1".

> texte <- "Hé hé"
> Encoding(texte)
[1] "latin1"
> texte2 <-  iconv(texte,"latin1","UTF-8")
> Encoding(texte2)
[1] "UTF-8"

Regular Expressions[edit | edit source]

A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern : 2 digits, 2 letters and 4 digits. R provides powerful functions to deal with regular expressions. Two types of regular expressions are used in R[3]

  • extended regular expressions, used by ‘perl = FALSE’ (the default),
  • Perl-like regular expressions used by ‘perl = TRUE’.

There is a also an option called ‘fixed = TRUE’ which can be considered as a literal regular expression. fixed() (stringr) is equivalent to fixed=TRUE in the standard regex functions. These functions are by default case sensitive. This can be changed by specifying the option ignore.case = TRUE.

If you are not a specialist in regular expression you may find the glob2rx() useful. This function suggests some regular expression for a specific ("glob" or "wildcard") pattern :

> glob2rx("abc.*")
[1] "^abc\\."

Functions which use regular expressions in R[edit | edit source]

  • sub(), gsub(), str_replace() (stringr) make some substitutions in a string.
  • grep(), str_extract() (stringr) extract some value
  • grepl(), str_detect() (stringr) detect the presence of a pattern.
  • see also splitByPattern() (R.utils)
  • See also gsubfn() in the gsubfn package.

Extended regular expressions (The default)[edit | edit source]

  • "." stands for any character.
  • "[ABC]" means A,B or C.
  • "[A-Z]" means any upper letter between A and Z.
  • "[0-9]" means any digit between 0 and 9.

Here is the list of metacharacters ‘$ * + . ? [ ] ^ { } | ( ) \’. If you need to use one of those characters, precede them with a doubled backslash.

Here are some classes of regular expressions : For numbers :

  • ‘[:digit:]’ Digits: ‘0 1 2 3 4 5 6 7 8 9’.

For letters :

  • ‘[:alpha:]’ Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’.
  • ‘[:upper:]’ Upper-case letters.
  • ‘[:lower:]’ Lower-case letters.

Note that the set of alphabetic characters includes accents such as é è ê which are very common in some languages like French. Therefore, it is more general than "[A-Za-z]" which does not include letters with accent.

For other characters :

  • ‘[:punct:]’ Punctuation characters: ‘! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~’.
  • ‘[:space:]’ Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
  • ‘[:blank:]’ Blank characters: space and tab.
  • ‘[:cntrl:]’ Control characters.

For combination of other classes :

  • [:alnum:] Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’.
  • ‘[:graph:]’ Graphical characters: ‘[:alnum:]’ and ‘[:punct:]’.
  • ‘[:print:]’ Printable characters: ‘[:alnum:]’, ‘[:punct:]’ and space.
  • ‘[:xdigit:]’ Hexadecimal digits: ‘0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f’.

You can quantify the number of repetition by adding after the regular expression the following characters :

  • ‘?’ The preceding item is optional and will be matched at most once.
  • ‘*’ The preceding item will be matched zero or more times.
  • ‘+’ The preceding item will be matched one or more times.
  • ‘{n}’ The preceding item is matched exactly ‘n’ times.
  • ‘{n,}’ The preceding item is matched ‘n’ or more times.
  • ‘{n,m}’ The preceding item is matched at least ‘n’ times, but not more than ‘m’ times.
  • ^ to force the regular expression to be at the beginning of the string
  • $ to force the regular expression to be at the end of the string

If you want to know more, have a look at the 2 following help files :

>?regexp # gives some general explanations
>?grep # help file for grep(),regexpr(),sub(),gsub(),etc

Perl-like regular expressions[edit | edit source]

It is also possible to use "perl-like" regular expressions. You just need to use the option perl=TRUE.

Examples[edit | edit source]

If you want to remove space characters in a string, you can use the \\s Perl macro.

sub('\\s', '',x, perl = TRUE)

See also[edit | edit source]

Concatenating strings[edit | edit source]

  • paste() concatenates strings.
  • str_c() (stringr) does a similar job.
  • cat() prints and concatenates strings.

Examples[edit | edit source]

> paste("toto","tata",sep=' ')
[1] "toto tata"
> paste("toto","tata",sep=",")
[1] "toto,tata"
> str_c("toto","tata",sep=",")
[1] "toto,tata"
> x <- c("a","b","c")
> paste(x,collapse=" ")
[1] "a b c"
> str_c(x, collapse = " ")
[1] "a b c"
> cat(c("a","b","c"), sep = "+")
a+b+c

Splitting a string[edit | edit source]

  • strsplit() : Split the elements of a character vector ‘x’ into substrings according to the matches to substring ‘split’ within them.
  • See also str_split() (stringr).
> unlist(strsplit("a.b.c", "\\."))
[1] "a" "b" "c"
  • tokenize() (tau) split a string into tokens.
> tokenize("abc defghk")
[1] "abc"    " "      "defghk"

Counting the number of characters in a string[edit | edit source]

  • nchar() gives the length of a string. Note that that for non-ASCII encodings, there is more one way to measure such a length.
  • See also str_length() (stringr)
> nchar("abcdef")
[1] 6
> nchar(NA)
[1] NA
> nchar("René")
[1] 4
> nchar("René", type = "bytes")
[1] 5

Detecting the presence of a substring[edit | edit source]

Detecting a pattern in a string ?[edit | edit source]

  • grepl() returns a logical expression (TRUE or FALSE).
  • str_detect() (stringr) does a similar job.
> string <- "23 mai 2000"
> string2 <- "1 mai 2000"
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> grepl(pattern = regexp, x = string)
[1] TRUE
> str_detect(string, regexp)
[1] TRUE
> grepl(pattern = regexp, x = string2)
[1] FALSE

The 1st one is true and the second one is false since there is only one digit in the first number.

Counting the occurrence of each pattern in a string ?[edit | edit source]

  • textcnt() (tau) counts the occurrence of each pattern or each term in a text.
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> textcnt(string,n=1L,method="string")
blabla    mai 
     2      2 
attr(,"class")
[1] "textcnt"

Extracting the position of a substring or a pattern in a string[edit | edit source]

Extracting the position of a substring ?[edit | edit source]

  • cpos() (cwhmisc) returns the position of a substring in a string.
  • substring.location() (cwhmisc) does the same job but returns the first and the last position.
 
> cpos("abcdefghijklmnopqrstuvwxyz","p",start=1)
[1] 16
> substring.location("abcdefghijklmnopqrstuvwxyz","def")
$first
[1] 4

$last
[1] 6

Extracting the position of a pattern in a string ?[edit | edit source]

  • regexpr() returns the position of the regular expression. str_locate() (stringr) does the same job. gregexpr() is similar to regexpr() but the starting position of every match is returned. str_locate_all() (stringr) does the same job.
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> regexpr(pattern = regexp, text = string)
[1] 8
attr(,"match.length")
[1] 11
> gregexpr(pattern = regexp, text = string)
[[1]]
[1]  8 27
attr(,"match.length")
[1] 11 11
> str_locate(string,regexp)
     start end
[1,]     8  18
> str_locate_all(string,regexp)
[[1]]
     start end
[1,]     8  18
[2,]    27  37

Extracting a substring from a string[edit | edit source]

Extracting a fixed width substring ?[edit | edit source]

  • substr() takes a sub string.
  • str_sub() (stringr) is similar.
> substr("simple text",1,3)
[1] "sim"
> str_sub("simple text",1,3)
[1] "sim"

Extracting the first word in a string ?[edit | edit source]

  • first.word() First Word in a String or Expression in the Hmisc package
> first.word("abc def ghk")
[1] "abc"

Extracting a pattern in a string ?[edit | edit source]

  • grep() returns the value of the regular expression if value=T and its position if value=F.
> grep(pattern = regexp, x = string , value = T) 
[1] "23 mai 2000"
> grep(pattern = regexp, x = string2 , value = T) 
character(0)
> grep(pattern = regexp, x = string , value = F) 
[1] 1
> grep(pattern = regexp, x = string2 , value = F) 
integer(0)
  • str_extract(), str_extract_all(), str_match(), str_match_all() (stringr) and m() (caroline package) are similar to grep(). str_extract() and str_extract_all() return a vector. str_match() and str_match_all() return a matrix and m() a dataframe.
> library("stringr")
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> str_extract(string,regexp)
[1] "23 mai 2000"
> str_extract_all(string,regexp)
[[1]]
[1] "23 mai 2000" "18 mai 2004"

> str_match(string,regexp)
     [,1]          [,2] [,3]  [,4]  
[1,] "23 mai 2000" "23" "mai" "2000"
> str_match_all(string,regexp)
[[1]]
     [,1]          [,2] [,3]  [,4]  
[1,] "23 mai 2000" "23" "mai" "2000"
[2,] "18 mai 2004" "18" "mai" "2004"
> library("caroline")
> m(pattern = regexp, vect = string, names = c("day","month","year"), types = rep("character",3))
  day month year
1  18   mai 2004
  • Named capture regular expressions can be used to define column names in the regular expression (this also serves to document the regular expression). Install the namedCapture package via devtools::install_github("tdhock/namedCapture") to use str_match_all_named(). It uses the base function gregexpr(perl=TRUE) to parse a Perl-Compatible Regular Expression, and returns a list of match matrices with column names:
> named.regexp <- paste0(
+   "(?<day>[[:digit:]]{2})",
+   " ",
+   "(?<month>[[:alpha:]]+)",
+   " ",
+   "(?<year>[[:digit:]]{4})")
> namedCapture::str_match_all_named(string, named.regexp)
[[1]]
     day  month year  
[1,] "23" "mai" "2000"
[2,] "18" "mai" "2004"

Making some substitution inside a string[edit | edit source]

Substituting a pattern in a string[edit | edit source]

  • sub() makes a substitution.
  • gsub() is similar to sub() but replace all occurrences of the pattern whereas sub() only replaces the first occurrence.
  • str_replace() (stringr) is similar to sub, str_replace_all() (stringr) is similar to gsub.

In the following example, we have a French date. The regular pattern is the following : 2 digits, a blank, some letters, a blank, 4 digits. We capture the 2 digits with the [[:digit:]]{2} expression, the letters with [[:alpha:]]+ and the 4 digits with [[:digit:]]{4}. Each of these three substrings is surrounded with parenthesis. The first substring is stored in "\\1", the second one in "\\2" and the 3rd one in "\\3".

string <- "23 mai 2000"
regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
sub(pattern = regexp, replacement = "\\1", x = string) # returns the first part of the regular expression
sub(pattern = regexp, replacement = "\\2", x = string) # returns the second part
sub(pattern = regexp, replacement = "\\3", x = string) # returns the third part

In the following example, we compare the outcome of sub() and gsub(). The first one removes the first space whereas the second one removes all spaces in the text.

> text <- "abc def ghk"
> sub(pattern = " ", replacement = "",  x = text)
[1] "abcdef ghk"
> gsub(pattern = " ", replacement = "",  x = text)
[1] "abcdefghk"

Substituting characters in a string ?[edit | edit source]

  • chartr() substitutes characters in an expression. It stands for "character translation".
  • replacechar() (cwhmisc) does the same job ...
  • as well as str_replace_all() (stringr).
> chartr(old="a",new="o",x="baba")
[1] "bobo"
> chartr(old="ab",new="ot",x="baba")
[1] "toto"
> replacechar("abc.def.ghi.jkl",".","_")
[1] "abc_def_ghi_jkl"
> str_replace_all("abc.def.ghi.jkl","\\.","_")
[1] "abc_def_ghi_jkl"

Converting letters to lower or upper-case[edit | edit source]

  • tolower() converts upper-case characters to lower-case.
  • toupper() converts lower-case characters to upper-case.
  • capitalize() (Hmisc) capitalize the first letter of a string
  • See also cap(), capitalize(), lower(), lowerize() and CapLeading() in the cwhmisc package.
> tolower("ABCdef")
[1] "abcdef"
> toupper("ABCdef")
[1] "ABCDEF"
> capitalize("abcdef")
[1] "Abcdef"

Filling a string with some character[edit | edit source]

  • padding() (cwhmisc) fills a string with some characters to fit a given length. See also str_pad() (stringr).
> library("cwhmisc")
> padding("abc",10," ","center") # adds blanks such that the length of the string is 10.
[1] "   abc    "
> str_pad("abc",width=10,side="center", pad = "+")
[1] "+++abc++++"
> str_pad(c("1","11","111","1111"),3,side="left",pad="0") 
[1] "001"  "011"  "111"  "1111"

Note that str_pad() is very slow. For instance for a vector of length 10,000, we have a very long computing time. padding()does not seem to handle character vectors but the best solution may be to use the sapply() and padding() functions together.

>library("stringr")
>library("cwhmisc")
>a <- rep(1,10^4)
> system.time(b <- str_pad(a,3,side="left",pad="0"))
utilisateur     système      écoulé 
     50.968       0.208      73.322 
> system.time(c <- sapply(a, padding, space = 3, with = "0", to = "left"))
utilisateur     système      écoulé 
      7.700       0.020      12.206

Removing leading and trailing spaces[edit | edit source]

  • trimws() (memisc package) trim leading and trailing white spaces.
  • trim() (gdata package) does the same job.
  • See also str_trim() (stringr)
> library("memisc")
> trimws("  abc def   ")
[1] "abc def" 
> library("gdata")
> trim(" abc def ")
[1] "abc def"
> str_trim("  abd def  ")
[1] "abd def"

Comparing two strings[edit | edit source]

Assessing if they are identical[edit | edit source]

  • == returns TRUE if both strings are the same and false otherwise.
> "abc"=="abc"
[1] TRUE
> "abc"=="abd"
[1] FALSE

Computing distance between strings[edit | edit source]

Few packages implement the Levenshtein distance between two strings:

  • adist() in base package utils
  • stringMatch() in MiscPsycho
  • stringdist() in stringdist
  • levenshteinDist() in RecordLinkage

A benchmark comparing the speed of levenshteinDist() and stringdist() is available here: [2].

Example with utils[edit | edit source]

> adist("test","tester")
[1] 2

Example with MiscPsycho[edit | edit source]

stringMatch() (MiscPsycho) computes If normalize="YES" the levenshtein distance is divided by the maximum length of each string.

> library("MiscPsycho")
> stringMatch("test","tester",normalize="NO",penalty=1,case.sensitive = TRUE)
[1] 2

Approximate matching[edit | edit source]

agrep() search for approximate matches using the Levenshtein distance.

  • If 'value = TRUE', this returns the value of the string
  • If 'value = FALSE' this returns the position of the string
  • max returns the maximal levenshtein distance.
>  agrep(pattern = "laysy", x = c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
>  agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 3, value = TRUE)
[1] "1 lazy"

Miscellaneous[edit | edit source]

  • deparse() : Turn unevaluated expressions into character strings.
  • char.expand() (base) expands a string with respect to a target.
  • pmatch() (base) and charmatch() (base) seek matches for the elements of their first argument among those of their second.
> pmatch(c("a","b","c","d"),table = c("b","c"), nomatch = 0)
[1] 0 1 2 0
  • make.unique() makes a character string unique. This is useful if you want to use a string as an identifier in your data.
> make.unique(c("a", "a", "a"))
[1] "a"   "a.1" "a.2"

References[edit | edit source]

  1. Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/2, http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
  2. http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
  3. In former versions (< 2.10) we had also basic regular expressions in R :
    • extended regular expressions, used by extended = TRUE (the default),
    • basic regular expressions, as used by extended = FALSE (obsolete in R 2.10).
    Since basic regular expressions (‘extended = FALSE’) are now obsolete, the extended option is obsolete in version 2.11.


Times and Dates

R contains a set of object types for holding date and time information. The system time and date can also be requested.

Format[edit | edit source]

Many time and date units are recognised. These include:

Unit Symbol Example
4 digit year %Y 1932
2 digit year %y 84
Numerical Month %m 03
Full Month %B January
Abbreviated Month %b Jan
Day of the month %d 31
Full weekday %A Wednesday
Abbreviated weekday %a Wed
Hours (24hr clock) %H 16
Minutes %M 35
Seconds %S 52


The default format is yyyy-mm-dd hh:mm:ss or %Y-%m-%d %H:%M:%S

For example 2010-02-13 23:12:24


System Date and Time[edit | edit source]

To get the system date and time:

> Sys.time()
 [1] "2010-02-13 23:12:24 COT"
> format(Sys.time(),"%H %M")   # in a different format and without the date
 [1] "23 13"
> Sys.Date()
 [1] "2010-02-13"
> date()                       # returns the current date and time,
[1] "Wed Jul 18 10:59:42 2012"

Convert strings to date/time objects[edit | edit source]

Convert a string representing the date or time into a Date/Time object:

> my.date <- as.Date("2010-12-30")
> print(my.date)
 [1] "2010-12-30"
> my.date2 <- as.Date("12/20/30", format="%m/%d/%y") # input date in a different format
> print(my.date2)
 [1] "2030-12-20"
> my.time <- strptime("12/20/30 14.34.35", format="%m/%d/%y %H.%M.%S") # input time and date
> print(my.time)
 [1] "2030-12-20 14:34:35"
> my.string <- as.character(Sys.time()) # convert a date/time object to a normal string
> print(my.string)
 [1] "2016-06-30 23:04:44"

Extracting information from dates[edit | edit source]

Get weekday, month and an integer representing the number of days since the beginning of epoch:

> weekdays(my.date) # Get a string representing the weekday of the specified date
[1] "Monday"
> months(my.date)
[1] "December" # Get the month as well
> my.date
[1] "2010-12-20"
> julian(my.date) # Get the integer number of days since the beginning of epoch
[1] 14963
attr(,"origin")
[1] "1970-01-01"

Note that weekdays() and months() returns results in the local language. For instance, if you turn R into French, you can get weekdays and months in French[1] :

> require("lubridate")
> Sys.setlocale(locale="fr_FR.UTF-8")
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8"
> mydate  <- ymd("2002-04-21")
> weekdays(mydate)
[1] "Dimanche"
> months(mydate)
[1] "avril"

Generating sequences of dates[edit | edit source]

> seq(from = as.Date("01/01/12", "%d/%m/%y"), to = as.Date("10/01/12","%d/%m/%y"), by = "day")
#create the 10 first days of January 2012
 [1] "2012-01-01" "2012-01-02" "2012-01-03" "2012-01-04" "2012-01-05" "2012-01-06"
 [7] "2012-01-07" "2012-01-08" "2012-01-09" "2012-01-10"

> seq(from = as.Date("20/01/12", "%d/%m/%y"), to = as.Date("20/12/12","%d/%m/%y"), by = "month")
#create the 20th of each month in 2012
 [1] "2012-01-20" "2012-02-20" "2012-03-20" "2012-04-20" "2012-05-20" "2012-06-20"
 [7] "2012-07-20" "2012-08-20" "2012-09-20" "2012-10-20" "2012-11-20" "2012-12-20"

> seq(from = as.Date("01/01/12", "%d/%m/%y"), to = as.Date("31/01/12","%d/%m/%y"), length.out = 16)
#create a sequence of every other day in january 2012
 [1] "2012-01-01" "2012-01-03" "2012-01-05" "2012-01-07" "2012-01-09" "2012-01-11"
 [7] "2012-01-13" "2012-01-15" "2012-01-17" "2012-01-19" "2012-01-21" "2012-01-23"
[13] "2012-01-25" "2012-01-27" "2012-01-29" "2012-01-31"

References[edit | edit source]

External links[edit | edit source]


Graphics

R includes at least three graphical systems, the standard graphics package, the lattice package for Trellis graphs[1] and the grammar-of-graphics ggplot2 package[2]. R has good graphical capabilities but there are some alternatives like gnuplot.


Interactive Graphics[edit | edit source]

This section discuss some ways to draw graphics without using R scripts.

The playwith package provides a graphical user interface to customize the graphs, add a title, a grid, some text, etc and it exports the R code you need if you want to replicate the analysis[3]. If you want to know more, you can have a look at the screenshots on the website (link). See also the example on "R you Ready" [3]. This package require GTK+ libraries.

library("playwith")
playwith(plot(x1))

There is also a graphical user interface GrapheR which makes it very easy to draw graphs for beginners[4]. This solution is cross-platform.

> library(GrapheR)

latticist (link) is another similar project.

Note also that some graphical user interface such as RKward and R Commander makes it easy to draw graphs.

Standard R graphs[edit | edit source]

In this section we present what you need to know if you want to customize your graphs in the default graph system.

  • plot() is the main function for graphics. The arguments can be a single point such as 0 or c(.3,.7), a single vector, a pair of vectors or many other R objects.
  • par() is another important function which defines the default settings for plots.
  • There are many other plot functions which are specific to some tasks such as hist(), boxplot(), etc. Most of them take the same arguments as the plot() function.
> N <- 10^2
> x1 <- rnorm(N) 
> x2 <- 1 + x1 + rnorm(N)
> plot(0) 
> plot(0,1) 
> plot(x1) 
> plot(x1,x2) # scatter plot x1 on the horizontal axis and x2 on the vertical axis
> plot(x2 ~ x1) # the same but using a formula (x2 as a function of x1)
> methods(plot) # show all the available methods for plot (depending on the number of loaded packages).

Titles, legends and annotations[edit | edit source]

Titles[edit | edit source]

main gives the main title, sub the subtitle. They can be passed as argument of the plot() function or using the title() function. xlab the name of the x axis and ylab the name of the y axis.

 plot(x1,x2, main = "Main title", sub = "sub title" , ylab = "Y axis", xlab = "X axis")
 plot(x1,x2 ,  ylab = "Y axis", xlab = "X axis")
 title(main = "Main title", sub = "sub title" )

The size of the text can be modified using the parameters cex.main, cex.lab, cex.sub, cex.axis. Those parameters define a scaling factor, ie the value of the parameter multiply the size of the text. If you choose cex.main=2 the main title will be twice as big as usual.

Legend[edit | edit source]

legend(). The position can be "bottomleft", "bottomright", "topleft", "topright" or exact coordinates.

plot(x1, type = "l", col = 1, lty = 1) 
lines(x2, col = 2, lty = 2) 
legend("bottomleft", legend = c("x1","x2"), col = 1:2, lty = 1:2)

Text in the margin[edit | edit source]

mtext() puts some texts in the margin. The margin can be at the bottom (1), the left (2), the top (3) or the right (4).

plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 1) # the bottom
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 2) # the left
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 3) # the top
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 4) # the right margin

Text in the graph[edit | edit source]

text()

Mathematical annotations[edit | edit source]

We can add mathematical symbols using expression() and makes some substitution in a formula using substitute().

?plotmath # gives help for mathematical annotations

Types[edit | edit source]

The type of a plot can be :

  • n for none (nothing is printed),
  • p for points,
  • l for lines,
  • b for both,
  • o for both overlayed,
  • h for histogram-like
  • and s/S for steps.
R code Output
x1 <- rnorm(50) 
png("plottype.png")
par(mfrow = c(2,2))
plot(x1, type = "p", main = "points", ylab = "", xlab = "")
plot(x1, type = "l", main = "lines", ylab = "", xlab = "")
plot(x1, type = "b", main = "both", ylab = "", xlab = "")
plot(x1, type = "o", main = "both overplot", ylab = "", xlab = "")
dev.off()
click on the graph to zoom

Axes[edit | edit source]

The default output print the axes. We can remove them with axes=FALSE. We can also change them using the axis() function.

> plot(x1,x2,axes=FALSE)
>
> plot(x1,x2,axes=FALSE)
> axis(1,col="red",col.axis="blue",font.axis=3)
> axis(2,col="red",col.axis="blue",font.axis=2,las=2)

las specifies the style of axis labels. It can be 0, 1, 2 or 3.

  • 0 : always parallel to the axis [default],
  • 1 : always horizontal,
  • 2 : always perpendicular to the axis,
  • 3 : always vertical.
R code Output
x1 <- rnorm(100)
par(mfrow = c(2,2))
plot(x1, las = 0, main = "las = 0", sub = "always parallel to the axis", xlab = "", ylab = "")
plot(x1, las = 1, main = "las = 1", sub = "always horizontal", xlab = "", ylab = "") 
plot(x1, las = 2, main = "las = 2", sub = "always perpendicular to the axis", xlab = "", ylab = "")
plot(x1, las = 3, main = "las = 3", sub = "always vertical", xlab = "", ylab = "")
click on the graph

It is also possible to add another y axis on the right by adding axis(4,).

Margins[edit | edit source]

Margins can be computed in inches or in lines. The default is par(mar = c(5,4,4,2)) which means that there are 5 lines at the bottom, 4 lines on the left, 4 lines in the top and 2 lines on the right. This can be modified using the par() function. If you want to specify margins in inches, use par(mai = c(bottom, left, top, right). If you want to modify margins in lines, use par(mar = c(bottom, left, top, right). See ?par to learn more about the topic.

Colors[edit | edit source]

The color of the points or lines can be changed using the col argument, fg for foreground colors (boxes and axes) and bg for background colors.

  • show.col(object=NULL) (Hmisc) package plots the main R colors with their numeric code.
  • The list of all colors in R (pdf)
colors() # list the r colors
show.col(object=NULL) # graphs the main R colors
plot(x1, col = "blue")
plot(x1, col = "red")
plot(x1, col = "red", col.axis = "dodgerblue", col.lab = "firebrick", col.main = "darkgreen", col.sub = "cyan4", main = "Testing colors", sub = "sub titles", ylab = "y axis", xlab = "x axis")
  • We can also generate new colors using the rgb() function. The first argument is the intensity of red, the second, the intensity of green and the third, the intensity of blue. They vary between 0 and 1 by default but this can be modified with the option max = 255. col2rgb() returns the RGB code of R colors. col2hex() (gplots) gives the hexadecimal code. col2grey() and col2gray() (TeachingDemos) converts colors to grey scale.
> mycolor <- rgb(.2,.4,.6)
> plot(x1, col = mycolor)
> col2rgb("pink")
      [,1]
red    255
green  192
blue   203
> library("gplots")
> col2hex("pink")
[1] "#FFC0CB"

Points[edit | edit source]

For points the symbols can be changed using the pch option which takes integer values between 0 and 25 or a single character. pch can also takes a vector as argument. In that case the first points will use the first element of the vector as symbol, and so on.

plot(x1, type = "p", pch = 0)
plot(x1, type = "p", pch = 10)
plot(x1, type = "p", pch = 25)
plot(x1, type = "p", pch = "a")
plot(x1, type = "p", pch = "*")
plot(x1[1:26], type = "p", pch = 0:25)
plot(x1[1:26], type = "p", pch = letters)

The following code displays all the symbols on the same plot :

x <- rep(1,25)
plot(x, pch = 1:25, axes = F, xlab = "", ylab = "")
text(1:25,.95,labels = 1:25)

points() adds points to an existing plot.

> plot(x1, pch = 0) # plot x1 
> points(x2, pch = 1, col = "red") # add x2 to the existing plot

Lines[edit | edit source]

We can change the line type with lty. The argument is a string ("blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash") or an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash). The line width can be changed with lwd. The default is lwd=1. lwd=2 means that the width is twice the normal width.

plot(x1, type = "l", lty = "blank")
plot(x1, type = "l", lty = "solid")
plot(x1, type = "l", lty = "dashed")
plot(x1, type = "l", lty = "dotted")
plot(x1, type = "l", lty = "dotdash")
plot(x1, type = "l", lty = "longdash")
plot(x1, type = "l", lty = "twodash")

lines() adds an additional lines on a graph.

plot(x1, type = "l", lty = "solid")
lines(x2, type = "l", lty = "dashed", col = "red")

abline() adds an horizontal line (h=), a vertical line (v=) or a linear function to the current plot (a= for the constant and b= for the slope). abline() can also plot the regression line.

> plot(x1, type = "l", lty = "solid")
> abline(h= -3, lty = "dashed", col = "gray")
> abline(v = 0, lty = "dashed", col = "gray")
> abline(a = -3 , b = .06, lty = "dotted", col = "red")

Boxes[edit | edit source]

Each graph is framed by a box. bty specifies the box type.

plot(x1, bty = "o") # the default
plot(x1, bty = "n") # no box
plot(x1, bty = "l")
plot(x1, bty = "7")
plot(x1, bty = "u")
plot(x1, bty = "c")
plot(x1, bty = "]")

See also box() to add a box to an existing plot.

Grid[edit | edit source]

grid() adds a grid to the current graph.

> plot(x1)
> grid()

Although grid has an optional argument nx for setting the number of grid lines, it is not possible to tell it explicitly where to place those lines (it will usually not place them at integer values). A more precise and manageable alternative is to use abline().

> abline(v=(seq(0,100,5)), col="lightgray", lty="dotted")
> abline(h=(seq(0,100,5)), col="lightgray", lty="dotted")

Arrows and segments[edit | edit source]

Polygons[edit | edit source]

Other figures[edit | edit source]

We can also add a circle to a plot with the circle() function in the calibrate package.

Background[edit | edit source]

You can choose the background of your plot. For instance, you can change the background color with par(bg=).

par(bg="whitesmoke")
par(bg="transparent")

Overlaying plots[edit | edit source]

matplot() can plot several plots at the same time.

N <- 100
x1 <- rnorm(N)
x2 <- rnorm(N) + x1 + 1
y <- 1 + x1 + x2 + rnorm(N)
mydat <- data.frame(y,x1,x2)
matplot(mydat[,1],mydat[,2:3], pch = 1:2)

Multiple plots[edit | edit source]

With par() we can display multiple figures on the same plot. mfrow = c(3,2) prints 6 figures on the same plot with 3 rows and 2 columns. mfcol = c(3,2) does the same but the order is not the same.

par(mfrow = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")

par(mfcol = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")


Plotting a function[edit | edit source]

  • curve() plots a function. This can be added to an existing plot with the option add = TRUE.
  • plot() can also plots functions.
curve(x^2, from = -1 , to = 1, main = "Quadratic function", ylab = "f(x)=x^2")

plot(rnorm(100))
curve((x/100)^2, add = TRUE, col = "red")

Exporting graphs[edit | edit source]

How can you export a graph ?

  • First you can plot the graph and use the context menu (right click on Windows and Linux or control + click on Mac) to copy or save the graphs. The available options depend on your operating system. On Windows, you can also use copy the current graph to the clipboard as a Bitmap file (raster graphics) using CTRL + C or as a Windows Metafile (vector graphics) using CTRL + W. You can then paste it into another application.
  • You can export a plot to pdf, png, jpeg, bmp or tiff by adding pdf("filename.pdf"), png("filename.png"), jpeg("filename.jpg"), bmp("filename.bmp") or tiff("filename.tiff") prior to the plotting, and dev.off() after the plotting.
  • You can also use the savePlot() function to save existing graphs.
  • Sweave also produce ps and pdf graphics (See the Sweave section).

It is better to use vectorial devices such as pdf, ps or svg.

How can you know the list of all available devices ?

  • ?Devices
  • Use the capabilities() function to see the list of available devices on your computer.
?Devices
> capabilities()
    jpeg      png     tiff    tcltk      X11     aqua http/ftp  sockets 
    TRUE     TRUE     TRUE     TRUE    FALSE    FALSE     TRUE     TRUE 
  libxml     fifo   cledit    iconv      NLS  profmem    cairo 
    TRUE    FALSE     TRUE     TRUE     TRUE     TRUE    FALSE
png("r_plot.png", width = 420, height = 340)
plot(x1, main = " Example")
dev.off()

pdf("r_plot.pdf", width = 420, height = 340) 
plot(x1, main = " Example")
dev.off()

postscript(file="graph1.ps",horizontal=F,pagecentre=F,paper="special",width=8.33,height=5.56) 
plot(x1, main = "Example")
dev.off()

plot(x1, main = "Example")
savePlot("W:/Bureau/plot.pdf", type = "pdf")
savePlot("W:/Bureau/plot.png", type = "png")

We can also export to SVG using the svg() function.

svg("scatterplot.svg", width = 7, height = 7)
plot(x, y)
dev.off()

The RSvgDevice library which was used in earlier versions of R seems now outdated.

Advanced topics[edit | edit source]

Animated plots[edit | edit source]

The animation package provides dynamic graphics capabilities. It is possible to export the animation in flash, mpeg or gif format. There are more example on the aniwiki website : http://animation.yihui.name/.

You can also create motion charts using the googleVis package[5].

Examples[edit | edit source]


Interactive Graphics[edit | edit source]

The iplots package provides a way to have interactive data visualization in R[6] ·[7].

To create an interactive, animated plot viewable in a web browser, the animint package can be used. The main idea is to define an interactive animation as a list of ggplots with two new aesthetics:

  • showSelected=variable means that only the subset of the data that corresponds to the selected value of variable will be shown.
  • clickSelects=variable means that clicking a plot element will change the currently selected value of variable.

Graphics gallery[edit | edit source]

In this section, we review all kind of statistical plots and review all alternatives to draw them using R. This include code for the standard graphics package, the lattice package and the ggplot2 package. Also, we add some examples from the commons repository. We only add examples which are provided with the R code. You can click on any graph and find the R code.

Line plot[edit | edit source]

To draw a line plot, use the generic plot() function by setting type="l".

> x <- seq(0, 2*pi, pi/10)
> plot(x, sin(x), type="l")

Then, you can add further lines on the same plot using the lines() function.

> lines(x, cos(x))

Examples[edit | edit source]

Scatter plot[edit | edit source]

  • plot(x,y)
  • plot(y ~ x)
  • xyplot(y ~ x) (lattice)
  • qplot(x,y) (ggplot2)

Log scale[edit | edit source]

Sometimes it is useful to plot the log of a variable and to have a log scale on the axis. It is possible to plot the log of a variable using the log option in the plot() function.

  • For a log log plot, use log = "xy"
  • For a log in the x axis only, use log = "x"
  • For a log in the x axis only, use log = "y"
plot(x, y , log = "xy")

Label points in a plot[edit | edit source]

  • It is possible to add labels with the text() function.
  • textxy() (calibrate) makes it easy to add labels.
N <- 10
u <-rnorm(N)
x <- 1 + rnorm(N)
y <- 1 + x + u
plot(x, y)
textxy(x, y,labs = signif(x,3), cx=0.7)

Examples[edit | edit source]

Histogram[edit | edit source]

  • hist()
  • histogram() (lattice)

You can learn more about histograms in the Non parametric methods page.

Examples[edit | edit source]

Box plot[edit | edit source]

Box plot :

  • boxplot()

Examples[edit | edit source]

See also[edit | edit source]

Bar charts[edit | edit source]

See Bar charts on wikipedia.

  • barplot() takes a table as argument and returns a bar chart.
  • qlot() (ggplot2) with the option geom = "bar" takes a variable as argument and returns a bar chart[8].
  • barchart() takes a variable as argument and returns a bar chart.

Examples[edit | edit source]

Dot plot[edit | edit source]

See also Dot plot on Wikipedia.

  • dotchart()

Examples[edit | edit source]

Pie charts[edit | edit source]

  • pie()

Examples[edit | edit source]

Treemap[edit | edit source]

The tmPlot() function in the treemap package makes it easy to draw a treemap.

Confidence interval plot[edit | edit source]

Standard error bar chart are very useful to plot several estimates with confidence intervals.

  • The Hmisc package has an errbar() function. This function takes the upper and lower bounds of the confidence intervals as argument[9].
  • coefplot() function in Gelman and Hill's arm package. This functions is designed to display estimation results. It takes point estimates and standard errors as arguments.
coefs <- c(0.2, 1.4, 2.3, 0.5,.3) # vector of point estimates
se <- c(0.12, 0.24, 0.23, 0.15,.2) # standard errors of point estimates
variable <- 1:5 # variable names
library("arm")
# we use CI = qnorm(.975) to have 95% confidence interval
coefplot(coefs, se, variable, vertical = T, CI = qnorm(.975)) 
coefplot(coefs, se, variable, vertical = F, CI = qnorm(.975))
library("Hmisc")
errbar(variable, coefs, coefs - qnorm(.975) * se, coefs + qnorm(.975) * se)

See also

  • There is another errbar() function in the sfsmisc package.
  • plotCI() (gplots) also plot error bars.
  • plotmeans() (gplots)
  • ciplot() (hacks)
  • See also Error bar on Wikipedia

3D plots[edit | edit source]

  • contour(), image(), persp()
  • plot3d() (rgl)
  • wireframe() (lattice)

Examples[edit | edit source]

Diagrams[edit | edit source]

  • grid package by Paul Murrell[10]
  • diagram package [11]
  • Rgraphviz package
  • igraph package

Arc Diagrams[edit | edit source]

It is also possible to draw Arc Diagrams[12].

Dendrograms[edit | edit source]

It is possible to plot dendrograms in R[13].

Treemap[edit | edit source]

It is possible to draw a treemap using the treemap() function in the treemap package[14].

Wordcloud[edit | edit source]

There is :

  • the wordcloud() function in the wordcloud package
  • the tagcloud() function in the tagcloud package

Timeline[edit | edit source]

  • timeline() in the timeline package

Maps[edit | edit source]

See also[edit | edit source]

Resources[edit | edit source]

References[edit | edit source]

  1. D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer, 2008. ISBN 9780387759685.
  2. ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham and a list of examples on his own website : http://had.co.nz/ggplot2/
  3. playwith : http://code.google.com/p/playwith/
  4. Hervé, Maxime (2011). "GrapheR: a Multiplatform GUI for Drawing Customizable Graphs in R" (PDF). The R Journal. 3 (2).
  5. Tutorial for the googleVis package : http://stackoverflow.com/questions/4646779/embedding-googlevis-charts-into-a-web-site/4649753#4649753
  6. http://www.r-bloggers.com/interactive-graphics-with-the-iplots-package-from-%E2%80%9Cr-in-action%E2%80%9D/
  7. http://www.r-statistics.com/2012/01/interactive-graphics-with-the-iplots-package-from-r-in-action/ Interactive Graphics with the iplots Package] - a chapter from the R in action book
  8. Hadley Wickham ggplot2: Elegant Graphics for Data Analysis, Springer Verlag, 2009
  9. The default output in errbar() changed between R version 2.8.1 and R version 2.9.2. Axis are not displayed by default anymore
  10. Paul Murrell Drawing Diagrams with R, The R Journal, 2009 http://journal.r-project.org/2009-1/RJournal_2009-1_Murrell.pdf
  11. (example: Using a binary tree diagram for describing a Bernoulli process)
  12. Gaston Sanchez (Feburary 3rd, 2013). "Arc Diagrams in R: Les Miserables". Retrieved February 5th, 2013. {{cite web}}: Check date values in: |accessdate= and |date= (help)
  13. Gaston Sanchez (October 3, 2012). "7+ ways to plot dendrograms in R". Retrieved February 5th, 2013. {{cite web}}: Check date values in: |accessdate= and |date= (help); line feed character in |date= at position 9 (help)
  14. http://cran.r-project.org/web/packages/treemap/treemap.pdf
  15. http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html
  16. http://had.co.nz/ggplot2/
Previous: Data Management Index Next: Descriptive Statistics


Grammar of graphics

Hadley Wickham has developped the ggplot2, a graphical library designed according to the principles of the Grammar of Graphics.

Plotting a function[edit | edit source]

We use qplot() with the option stat=function :

# Plot the quadratic function
square <- function(x){
  x^2
}
mode(square)
qplot(c(0, 2), stat = "function", fun = square, geom = "line")

Here is another example with the sinus function  :

# plot the sinus functon
qplot(c(-10, 10), stat = "function", fun = sin, geom = "line")

Bibliography[edit | edit source]

  • Leland Wilkinson, The Grammar of Graphics (Statistics and Computing), Springer, 2005
  • Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis, Use R!, Springer, 2009

Resources[edit | edit source]


Publication quality ouput

Formatting numbers[edit | edit source]

You can use the format() function to control the number of digits and other characteristics of a displayed object.

> df <- data.frame(x = rnorm(10), y = rnorm(10))
> print(df)
            x          y
1  -0.4350953 -0.6426477
2  -0.5947293 -0.2389625
3  -0.7061850 -2.4382016
4  -0.3384038 -0.6322842
5   0.2713353  0.5396409
6  -1.1144711 -2.0321274
7  -1.0356184  1.7217443
8  -2.6665278 -0.3621377
9   0.2975570  0.1598905
10  1.4631458 -0.7995652
> print(format(df, digits=3, scientific=T))
           x         y
1  -4.35e-01 -6.43e-01
2  -5.95e-01 -2.39e-01
3  -7.06e-01 -2.44e+00
4  -3.38e-01 -6.32e-01
5   2.71e-01  5.40e-01
6  -1.11e+00 -2.03e+00
7  -1.04e+00  1.72e+00
8  -2.67e+00 -3.62e-01
9   2.98e-01  1.60e-01
10  1.46e+00 -8.00e-01

Sweave[edit | edit source]

Sweave[1] is a literate programming language which integrates LaTeX and R code. The Sweave file generates a LaTeX file and an R file which can in turn be compiled. Roger Koenker[2], Meredith and Racine (2009)[3] and Charles Geyer[4] argue that Sweave favors reproducible econometric/statistical research.

There are some alternatives to Sweave for literate programming. One of them is Babel which is included in Emacs Orgmode[5]. This tool allow export to LaTeX and HTML. It is also possible to include code chunks for various programming languages (R, Ruby, etc).

Syntax[edit | edit source]

The main idea is that you write a file which includes LaTeX and R code. LaTeX code begins with @ and R code with <<>>= (some options can be included between << and >>).

@
% Some LaTeX code
\section{Results}
I show that ...
<<>>=
# Some R code
qnorm(.975)
@
% Some LaTeX code
$$
\Phi^{-1}(.975) = 1.96 
$$

The file is stored with extension .Rnw or .rnw. At the end, you extract from this file an R file using Stangle() and a LaTeX file using Sweave(). Here is an example with a file called file.Rnw which generates file.tex and file.R

> Sweave("file.Rnw")
Writing to file file.tex
Processing code chunks ...
 1 : echo keep.source term verbatim pdf
 2 : echo keep.source term verbatim pdf
> Stangle("file.Rnw")
Writing to file file.R

Then you can run LaTeX on your file.tex. This can be done using the system() function or texi2dvi().

# Example under Windows :
system("pdflatex.exe -shell-escape file.tex") # runs pdflatex
system("open file.pdf") # opens the pdf

Note that you may need to download Sweave.sty from the internet since it is not part of the standard MikTeX distribution.

You can also add your results in your text using the \Sexpr{} function.

$
\Phi^{-1}(.975) = \Sexpr{qnorm(.975)} 
$

Options[edit | edit source]

There are some options. These options can be included for each code chunk or in the Sweave command.

  • For figures, you can either include them in the tex file using fig=T or not include them using fig=F.

By default, figures are exported as pdf and eps files. If you only want one format suppress the other one with pdf=F or eps=F option.

  • The R code can be displayed in the tex file using echo=T. If you don't want to include it in the tex file, use echo=F.
  • The R code can be evaluated using eval=T. If you don't want to evaluate the R code, use eval=F.
  • The results :
    • results=tex treats the output as LaTeX code
    • results=verbatim treats the output as Verbatim (the default)
    • results=hide does not include the results in the LaTeX output

These options can be passed to the Sweave() function.

Sweave("file.Rnw", pdf = T, eps=F, echo = F, results = "verbatim")

They can also be passed to each code chunk.

<<fig=T,pdf=T,eps=F>>=
plot(rnorm(100), col = "red")
@

Text editor for Sweave[edit | edit source]

The main issue with Sweave is that few text editors include syntax highlighting for Sweave. Here are some exceptions :

  • RStudio is a very good solution. It is easy to install and use and it includes buttons to run Sweave files.
  • Vim provides syntax highlighting for Sweave file (R no web syntax)
  • Emacs + ESS (Emacs Speaks Statistics) provides full support for Sweave file. It includes a keyboard shortcut to run Sweave files and syntax highlighting switching between LaTeX and R.
  • Eclipse StatET plugin provides support for Sweave (LaTeX/R) documents with all basic features (syntax highlighting, bracket matching, toggle comment, ...) and with detection of R chunks.

See also[edit | edit source]

Some example of Sweave documents :

  • Charles Geyer foo.Rnw example
  • Julien Barnier's introduction to R (document in french)
  • trick : type filetype:Rnw or filetype:Snw in Google to get Sweave files
  • Notice that you can find lots of examples by browsing in the R library folder. The documentation is often written using Sweave and the Sweave file is often included in the package. See for instance in the np package the doc folder.

Some handouts :

  • "Literate Programming with Sweave and DOCSTRIP" (pdf) by Michael Lundholm
  • Charles Geyer 2008 "An Sweave Demo" (pdf) (short)
  • Learning To Sweave in APA Style[6]

Some packages

  • pgfSweave package
  • ascii package
  • cacheSweave
  • exam automatic generation of exams

Some alternative literate programming packages :

  • odfWeave package to Sweave with OpenOffice.
  • knitr package
  • decumar, a literate programming interface for R by Hadley Wickham[7]
  • relax package
  • wikirobot[8] is similar to Sweave but works with MediaWiki.

Pubprint[edit | edit source]

Pubprint is a small utility that is able to transform the output of statistical tests to publication ready output. Pubprint is able to export outputs to severall formats (HTML, LaTeX, Markdown and plain text), but unfortunately supports only the APA style (publication style of the American Psychological Association). However, this style is widely used and may be appropriate in more cases.

Example[edit | edit source]

> library("pubprint")
> pprint(t.test(rnorm(30), rnorm(30)))
[1] "(\\ensuremath{M\\ifmmode_{x}\\else\\textsubscript{x}\\fi=-0.05,M\\ifmmode_{y}\\else\\textsubscript{y}\\fi=0.09,t[57.74]=-0.49,p=.628})"

Obviously pubprint prints a LaTeX formatted string, but changing the output format is possible (according to the manual pubprint is intended to use with knitr and detects output format automatically if it is used with it):

> pp_opts_out$set(pp_init_out("plain"))
> pprint(t.test(rnorm(30), rnorm(30)))
[1] "(M_x=-0.14,M_y=-0.24,t[57.4]=0.41,p=.682)"
> pprint(cor.test(rnorm(30), rnorm(30)))
[1] "(r=-.08,p=.693)"

The output can be pasted into a documented or may included in a knitr/sweave \Sexpr{} statement.

Export to LaTeX[edit | edit source]

R has lots of functions which allow it to export results to LaTeX[9].

General functions[edit | edit source]

toLatex() in the utils package.

  • Note that toLatex() does not handle matrices.
  • toLatex() has been adapted to handle matrices and ftables in the memisc package.
> toLatex(sessionInfo())
\begin{itemize}
  \item R version 2.2.0, 2005-10-06, \verb|powerpc-apple-darwin7.9.0|
  \item Base packages: base, datasets, grDevices,
    graphics, methods, stats, utils
\end{itemize}
  • mat2tex() (sfsmisc) exports matrix to LaTeX.
  • tex.table() (cwhmisc) package exports a dataframe into a LaTeX table.
> tex.table(mydat)
\begin{table}[ht]
\begin{center}
\begin{footnotesize}
\begin{tabular}{r|rrr}
\hline
 & y & x1 & x2\\ \hline
1 & -0.09 & -0.37 & -1.04\\ 
2 & 0.31 & 0.19 & -0.09\\ 
3 & 3.78 & 0.58 & 0.62\\ 
4 & 2.09 & 1.40 & -0.95\\ 
5 & -0.18 & -0.73 & -0.54\\ 
6 & 3.16 & 1.30 & 0.58\\ 
7 & 2.78 & 0.34 & 0.77\\ 
8 & 2.59 & 1.04 & 0.46\\ 
9 & -1.96 & 0.92 & -0.89\\ 
10 & 0.91 & 0.72 & -1.1\\ 
\hline
\end{tabular}
\end{footnotesize}
\end{center}
\end{table}


  • xtable() (xtable) exports various objects, including tables, data frames, lm, aov, and anova, to LaTeX.
> # lm example
> library(xtable)
> x <- rnorm(100)
> y <- 2*x + rnorm(100)
> lin <- lm(y~x)
> xtable(lin)
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 21:54:04 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
  \hline
 & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ 
  \hline
(Intercept) & -0.0407 & 0.0984 & -0.41 & 0.6803 \\ 
  x & 2.0466 & 0.1043 & 19.63 & 0.0000 \\ 
   \hline
\end{tabular}
\end{center}
\end{table}

> # table example
> x <- sample(1:10, 30, replace = T)
> tab <- table(x)
> tab <- cbind(tab, prop.table(tab))
> colnames(tab) <- c("N.", "Prop.")
> xtable(tab, digits = c(0, 0, 2))
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 22:06:36 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrr}
  \hline
 & N. & Prop. \\ 
  \hline
1 & 5 & 0.17 \\ 
  3 & 1 & 0.03 \\ 
  4 & 3 & 0.10 \\ 
  5 & 6 & 0.20 \\ 
  6 & 5 & 0.17 \\ 
  7 & 3 & 0.10 \\ 
  8 & 2 & 0.07 \\ 
  9 & 2 & 0.07 \\ 
  10 & 3 & 0.10 \\ 
   \hline
\end{tabular}
\end{center}
\end{table}

See also :

  • The highlight package by Romain François exports R code to LaTeX and HTML.
  • format.df() and latex() in the Hmisc package.
  • The MEMISC and the quantreg packages include other latex() function.

Descriptive statistics[edit | edit source]

  • estout package.
  • The reporttools package include some functions for table of descriptive statistics[10].

Estimation results[edit | edit source]

  • The stargazer package provides an easy way to export the results of regressions to LaTeX[11]
  • texreg provides the same kind of features[12].
  • The estout package provides functions similar to the Stata's esttab and estout utilities[13]. Estimates are stored using eststo() and printed using esttab(). They can be exported to CSV and LaTeX. These functions support lm, glm and plm objects (see plm package).
  • apsrtable() (apsrtable) exports the results of multiple regression to LaTeX in a way similar to the American Political Science Review publication standard.
  • The xtable (xtable package) exports dataframes, matrix, estimation results[14]. xtable() can also be used to export the results to an HTML file.
  • The outreg() function[15] developped by Paul Johnson is similar to the Stata outreg[16] function. See "R you ready ?" post on this topic.
  • mtable() and toLatex() in the 'memisc package.
N <- 10^3
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
y <- 1 + x1 + x2 + u
lm1 <- lm(y ~ x1 + x2 )
lm2 <- lm(y ~ x1 + x2 + I(x1*x2))

library(estout)
estclear() # clear all the eststo objects
eststo(lm1) 
eststo(lm2)
esttab() # print it

library("apsrtable")
apsrtable(lm1,lm2)

library(xtable)
xtable(lm1)
tab <- xtable(lm1)
print(tab,type="html")

source("http://pj.freefaculty.org/R/WorkingExamples/outreg-worked.R")
outreg(list(lm1,lm2))

library("memisc")
toLatex(mtable(lm1,lm2))

Export to HTML[edit | edit source]

The rpublisher[17] is a literate programming language which publish results in HTML (it is based on python and was last updated in 2008).


See R2HTML, xtable, hwriter, prettyR, highlight, HTMLUtils


wiki.table() in the hacks package export a matrix or a dataframe into Mediawiki table markup (as used on this wiki and many others).

> wiki.table(matrix(1:16,4),caption="Test")
{|  
|+ Test 
| 1 || 5 || 9 || 13 
|-
| 2 || 6 || 10 || 14 
|-
| 3 || 7 || 11 || 15 
|-
| 4 || 8 || 12 || 16 
|}

References[edit | edit source]

Previous: Text Processing Index


Descriptive Statistics

In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes univariate and bivariate statistical tools.

Generic Functions[edit | edit source]

We introduce some functions to describe a dataset.

  • names() gives the names of each variable
  • str() gives the structure of the dataset
  • summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.
> summary(mydat)
  • describe() (Hmisc package) gives more details than summary()
> library("Hmisc")
> describe(mydat)
  • contents() (Hmisc package)
  • dims() in the Zelig package.
  • descr() in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for factors and length for character vectors.
  • whatis() (YaleToolkit) gives a good description of a dataset.
  • detail() in the SciencesPo package gives a broad range of statistics for continuous variables, frequency tables for factors and length for character vectors.
  • describe() in the psych package also provides summary statistics:
> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
            var   n  mean   sd median trimmed  mad    min   max range  skew kurtosis   se
sample.data   1 100  0.37 3.21   0.00    0.07 0.31  -9.02 24.84 33.86  4.79    36.91 0.32
x             2 100  0.54 0.28   0.56    0.55 0.35   0.02  1.00  0.98 -0.12    -1.13 0.03
z             3 100  0.12 6.28   0.02   -0.01 1.14 -30.40 37.93 68.33  1.49    22.33 0.63
y             4 100 -0.01 1.07   0.09   -0.02 1.12  -2.81  2.35  5.16  0.00    -0.30 0.11

Univariate analysis[edit | edit source]

Continuous variable[edit | edit source]

Moments[edit | edit source]

  • mean() computes the mean
  • the variance : var().
  • the standard deviation sd().
  • the skewness skewness() (fUtilities, moment or e1071)
  • the kurtosis : kurtosis() (fUtilities, moment or e1071)
  • all the moments : moment() (moment) and all.moments() (moment).
> library(moments)
>  x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506


Order statistics[edit | edit source]

  • the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of a vector), min() the minimum and max() the maximum.
  • IQR() computes the interquartile range. median() computes the median and mad() the median absolute deviation.
  • quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample size is big.
> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
       0.00        0.25        0.50        0.75        1.00 
-3.07328999 -0.66901899  0.02157989  0.72378407  2.92897970 
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
attr(,"class")
[1] "kuantile"


Inequality Index[edit | edit source]

  • The gini coefficient : Gini() (ineq) and gini() (reldist).
  • ineq() (ineq) gives all inequalities index.
> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) #  Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694


  • Concentration index
> library(ineq)
> Herfindahl(x)
[1] 0.003091162
>  Rosenbluth(x)
[1] 0.002141646


  • Poverty index
> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index


Plotting the distribution[edit | edit source]

We can plot the distribution using a box plot (boxplot()), an histogram (hist()), a kernel estimator (plot() with density()) or the empirical cumulative distribution function (plot() with ecdf()). See the Nonparametric section to learn more about histograms and kernel density estimators. qqnorm() produces a normal QQ plot and qqline() adds a line to the QQ plot which passes through the first and the third quartile.

  • A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the maximum.
  • stripchart() and stem() are also availables.
> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
 
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one


Goodness of fit tests[edit | edit source]

Kolmogorov Smirnov Test :

The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the difference between the empirical cumulative distribution function and the theoritical cumulative distribution function. KSd() (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.

> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)


Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test() implements the Shapiro Wilk Normality Test

> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)

         Lilliefors (Kolmogorov-Smirnov) normality test

data:  x 
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x 
W = 0.9916, p-value = 0.7902
> library("nortest")
> ad.test(x)

	Anderson-Darling normality test

data:  x 
A = 0.2541, p-value = 0.7247

See also the package ADGofTest for another version of this test[1].

> sf.test(x)

	Shapiro-Francia normality test

data:  x 
W = 0.9866, p-value = 0.9953
> library("nortest")
> pearson.test(x)

	Pearson chi-square normality test

data:  x 
P = 0.8, p-value = 0.8495
  • Cramer-von Mises normality test
> cvm.test(x)

	Cramer-von Mises normality test

data:  x 
W = 0.0182, p-value = 0.9756
> jarque.bera.test(x)

	Jarque Bera Test

data:  x 
X-squared = 0.6245, df = 2, p-value = 0.7318

Discrete variable[edit | edit source]

We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart (pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot() (lattice)).

  • freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.
> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x) 
x 
      Frequency Percent
A            32      32
B            34      34
C            34      34
Total       100     100


Multivariate analysis[edit | edit source]

Continuous variables[edit | edit source]

  • Covariance : cov()
  • Pearson's linear correlation : cor().
  • Pearson's correlation test cor.test() performs the test.
  • Spearman's rank correlation :
    • cor() with method = "spearman".
    • spearman() (Hmisc)
  • Spearman's rank correlation test :
    • spearman2() (Hmisc)
    • spearman.test() (Hmisc)
    • spearman.test() (pspearman package) performs the Spearman’s rank correlation test with precomputed exact null distribution for n <= 22.
  • Kendall's correlation : cor() with method = "kendall". See also the Kendall package.
> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot 
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")


Discrete variables[edit | edit source]

  • table(), xtabs() and prop.table() for contingency tables. ftable() (stats package) for a flat (nested) table.
  • assocplot() and mosaicplot() for graphical display of contingency table.
  • CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher independence tests.
  • my.table.NA() and my.table.margin() (cwhmisc)
  • chisq.detail() (TeachingDemos)

Discrete and Continuous variables[edit | edit source]

  • bystats() Statistics by Categories in the Hmisc package
  • summaryBy() (doBy)
  • Multiple box plots : plot() or boxplot()
> N <- 100
> x <- sample(1:4,N, replace = T) 
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean) 
> bystats(y , as.factor(x), fun = quantile)


  • Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality of two distributions ks.test().
N <- 100
x <- sample(0:1,N, replace = T) 
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)


References[edit | edit source]

  1. Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http://CRAN.R-project.org/package=ADGofTest
Previous: Graphics Index Next: Linear Models


Mathematics

Basics[edit | edit source]

?Arithmetic
?Special

Linear Algebra[edit | edit source]

Vectors[edit | edit source]

The inner product[edit | edit source]

The inner product is also called the dot product or the scalar product. It is the sum of the item-by-item product.

> u <- rep(3,3)
> v <- 1:3
> u%*%v # the inner product
     [,1]
[1,]   18

The outer product[edit | edit source]

The outer product is also called the cross product or the vector product. It is a matrix resulting from the product of the elements of the two vectors.

> v <- rep(3,3)
> u <- 1:3
> u%o%v # The outer product
     [,1] [,2] [,3]
[1,]    3    3    3
[2,]    6    6    6
[3,]    9    9    9

Matrix Algebra[edit | edit source]

If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by column (the default option) with byrow. You can also combine vectors using cbind() or rbind(). The dimension of a matrix can be obtained using the dim() function or alternatively nrow() and ncol().

> matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
> matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
> v1 <- 1:5
> v2 <- 5:1
> cbind(v1,v2)
> rbind(v1,v2)
> dim(X)
> nrow(X)
> ncol(X)

Some special matrix[edit | edit source]

The identity matrix has ones on the diagonal and zeros outside the diagonal.

  • eye() (matlab)
  • diag(1,nrow=10,ncol=10)
  • diag(rep(1,10))

J matrix is full of ones

  • ones() (matlab)

A matrix full of zeros

  • zeros() (matlab)
> library(matlab)
> eye(3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1
> ones(3)
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1
> zeros(3) 
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0

Diagonal matrix

> diag(3)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Upper triangular

> round(upper.tri(matrix(1, n, n))) 

for n=3
     [,1] [,2] [,3]
[1,]    0    1    1
[2,]    0    0    1
[3,]    0    0    0

If you also need the diagonal of one's 

> round(upper.tri(matrix(1, 3, 3), diag = TRUE))

      [,1] [,2] [,3]
[1,]    1    1    1
[2,]    0    1    1
[3,]    0    0    1

Lower triangular

Same as upper triangular but using lower.tri instead


  • create an Hilbert matrix using hilbert() (fUtilities).

Matrix calculations[edit | edit source]

> b <- matrix(nrow = 2, ncol = 2, c(1, 2, 3, 4))
> a <- matrix(nrow = 2, ncol = 2, c(1, 0, 0, -1))
> a
     [,1] [,2]
[1,]    1    0
[2,]    0   -1
> b
     [,1] [,2]
[1,]    1    3
[2,]    2    4
> a%*%b
     [,1] [,2]
[1,]    1    3
[2,]   -2   -4
> b%*%a
     [,1] [,2]
[1,]    1   -3
[2,]    2   -4
> M <- matrix(rep(2,4),nrow = 2) 
> M
     [,1] [,2]
[1,]    2    2
[2,]    2    2
> I <- eye(2) 
> I
     [,1] [,2]
[1,]    1    0
[2,]    0    1
> I %x% M 
     [,1] [,2] [,3] [,4]
[1,]    2    2    0    0
[2,]    2    2    0    0
[3,]    0    0    2    2
[4,]    0    0    2    2
> library(fUtilities)
> kron(I,M)
     [,1] [,2] [,3] [,4]
[1,]    2    2    0    0
[2,]    2    2    0    0
[3,]    0    0    2    2
[4,]    0    0    2    2

Matrix transposition[edit | edit source]

  • Transpose the matrix
> t(M)
     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    0    1    2
[3,]    0    0    1

The trace and determinant of a matrix[edit | edit source]

  • compute the trace of a matrix using tr() (fUtilities)
  • returns the rank of a matrix using rk() (fBasics:)

Matrix inversion[edit | edit source]

  • Invert a matrix using solve() or inv() (fUtilities). We can also compute the generalized inverse using ginv() in the MASS package.
> M <- cbind(c(1,0,1),c(0,1,2),c(0,0,1))
> solve(M)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]   -1   -2    1
> solve(M)%*%M
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Solving a linear equation[edit | edit source]

> m=matrix(nrow=2,ncol=2,c(1,-.8,1,.2))
> m
     [,1] [,2]
[1,]  1.0  1.0
[2,] -0.8  0.2
> 
> l=matrix(c(1.0+25.0/18,25.0/18.0))
> l
         [,1]
[1,] 2.388889
[2,] 1.388889
> 
> k=solve(m,l)
> k
           [,1]
[1,] -0.9111111
[2,]  3.3000000
> 
> m%*%k          #checking the answer
         [,1]
[1,] 2.388889
[2,] 1.388889
>


Eigenvalue, eigenvector and eigenspace[edit | edit source]

  • Eigenvalues and eigenvectors
> eigen(M)
$values
[1] 1 1 1

$vectors
     [,1]          [,2]          [,3]
[1,]    0  2.220446e-16  0.000000e+00
[2,]    0  0.000000e+00  1.110223e-16
[3,]    1 -1.000000e+00 -1.000000e+00

Misc[edit | edit source]

  • compute the norm of a matrix using norm() (fUtilities).
  • check if a matrix is positive definite isPositiveDefinite() (fUtilities).
  • make a matrix positive definite makePositiveDefinite() (fUtilities).
  • computes row statistics and column statistics (fUtilities).
  • extract the upper and the lower part of a matrix triang() and Triang() (fUtilities).
  • See also the matrix, matlab, matrixcalc, matrixStats packages.

Analysis[edit | edit source]

Logarithm and Exponents[edit | edit source]

We have the power function 10^3 or 10**3 , the logarithm and the exponential log(2.71), log10(10),exp(1).

> 10^3 # exponent
[1] 1000
> 10**3 # exponent
[1] 1000
> exp(1) # exponential
[1] 2.718282
> log(2.71) # natural logarithm
[1] 0.9969486
> log10(1000) # base 10 logarithm
[1] 3
> log(1000,base = 10) # base 10 logarithm
[1] 3


Polynomial equations[edit | edit source]

To solve , where are given numbers, use the command

> polyroot(c(n,...,b,a))

So, for example, to calculate the roots of the equation one would do as follows:

> polyroot(c(-3,-5,2))
 [1] -0.5+0i  3.0-0i

and the solution can be read to be .

See also polynom and multipol packages

Derivatives[edit | edit source]

Symbolic calculations[edit | edit source]

R can give the derivative of an expression. You need to convert your function as an expression using the expression() function. Otherwise you get an error message.

Here are some examples :

> D(expression(x^n),"x")
x^(n - 1) * n
> D(expression(exp(a*x)),"x")
exp(a * x) * a
> D(expression(1/x),"x")
-(1/x^2)
> D(expression(x^3),"x")
3 * x^2
> D(expression(pnorm(x)),"x")
dnorm(x)
> D(expression(dnorm(x)),"x")
-(x * dnorm(x))

Numerical approximation[edit | edit source]

  • numDeriv package

Integration[edit | edit source]

R can perform one dimensional integration. For example we can integrate over the density of the normal distribution between and

> integrate(dnorm,-Inf,Inf)
1 with absolute error < 9.4e-05
> integrate(dnorm,-1.96,1.96)
0.9500042 with absolute error < 1.0e-11
> integrate(dnorm,-1.64,1.64)
0.8989948 with absolute error < 6.8e-14
# we can also store the result in an object
> ci90 <- integrate(dnorm,-1.64,1.64)
> ci90$value
[1] 0.8989948
> integrate(dnorm,-1.64,1.64)$value
[1] 0.8989948

see the adapt package for multivariate integration.

> library(adapt)
> ?adapt
> ir2pi <- 1/sqrt(2*pi)
> fred <- function(z) { ir2pi^length(z) * exp(-0.5 * sum(z * z))}
> 
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred)
       value       relerr       minpts       lenwrk        ifail 
    1.039222 0.0007911264          231           73            0 
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-4)
       value       relerr       minpts       lenwrk        ifail 
    1.000237 1.653498e-05          655          143            0 
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-6)
      value      relerr      minpts      lenwrk       ifail 
   1.000039 3.22439e-07        1719         283           0
  • See also integrate.gh() in the ecoreg package.

Probability[edit | edit source]

  • The number of combination of length k within n numbers :
> choose(100, 5)
[1] 75287520
  • Union and intersection
> union(1:10, 5:7)
[1]  1  2  3  4  5  6  7  8  9 10
> intersect(1:10, 5:7)
[1] 5 6 7

Arithmetics[edit | edit source]

The factorial function[edit | edit source]

factorial returns the factorial of an integer. This can also be computed using the prod() (product) applied to the vector of integers between 1 and the number of interest.

> factorial(3)
[1] 6
> prod(1:3)
[1] 6

Note that by convention . factorial() returns 1 in 0. This is not the case with the prod() functions.

> factorial(0)
[1] 1
> prod(0)
[1] 0

Factorial numbers can be very large and cannot be computed for high values.

> factorial(170)
[1] 7.257416e+306
> factorial(171)
[1] Inf
Message d'avis :
In factorial(171) : value out of range in 'gammafn'

The modulo function and euclidian division[edit | edit source]

  • Modulo and integer division (i.e. euclidean division)
> 5%%2
[1] 1
>5%/%2
[1] 2

Note: R is affected by the problem with non integer numbers and euclidian divisions.

> .5%/%.1 # we get 4 instead of 5
[1] 4
> .5%%.1 # we get .1 instead of 0
[1] 0.1

Geometry[edit | edit source]

  • pi the constant
  • cos(), sin(), tan() the trigonometric functions.

Symbolic calculus[edit | edit source]

rSymPy (rsympy) provides sympy (link) functions in R.

If you want to do more symbolic calculus, see Maxima[1], SAGE[2], Mathematica[3]

See also[edit | edit source]

The following command gives help on special mathematical functions related to the beta and gamma functions.

?Special

References[edit | edit source]

  1. Maxima is open source http://maxima.sourceforge.net/
  2. SAGE is an open source package which includes R and Maxima : http://www.sagemath.org/
  3. Mathematica is not open source http://www.wolfram.com/products/mathematica/index.html


Optimization


Numerical Methods[edit | edit source]

One dimensional problem[edit | edit source]

The one dimensional problem :

> func <- function(x){
+ 	return ( (x-2)^2 )
+ 	}
> (func(-2))
[1] 16
>
> # plot your function using the 'curve function'
> curve(func,-4,8) 
>
> # Here is another way to plot the function
> # using a grid
> grid <- seq(-10,10,by=.1) 
> func(grid)
> plot(grid,func(grid))
> 
> # you can find the minimum using the optimize function
> optimize(f=func,interval=c(-10,10))
$minimum
[1] 2

$objective
[1] 0

Newton-Raphson[edit | edit source]

  • nlm() provides a Newton algorithm.
  • maxLik package for maximization of a likelihood function. This package includes the Newton Raphson method.
  • newtonraphson() in the spuRs package.

BFGS[edit | edit source]


> func <- function(x){
+ 	out <- (x[1]-2)^2 + (x[2]-1)^2
+ 	return <- out
+ 	}> 
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("BFGS"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("L-BFGS-B"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)

Conjugate gradient method[edit | edit source]

  • optim() with method="cg".

Trust Region Method[edit | edit source]

  • "trust" package for trust region method


The Nelder-Mead simplex method[edit | edit source]

> func <- function(x){
+ 	out <- (x[1]-2)^2 + (x[2]-1)^2
+ 	return <- out
+ 	}
> 
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("Nelder-Mead"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)


  • The boot package includes another simplex method

Simulation methods[edit | edit source]

Simulated Annealing[edit | edit source]

  • The Simulated Annealing is an algorithm which is useful to maximise non-smooth functions. It is pre implemented in optim().
> func <- function(x){
+ 	out <- (x[1]-2)^2 + (x[2]-1)^2
+ 	return <- out
+ 	}> 
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("SANN"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)

EM Algorithm[edit | edit source]

Genetic Algorithm[edit | edit source]

  • rgenoud package for genetic algorithm[3]
  • gaoptim package for genetic algorithm[4]
  • ga general purpose package for optimization using genetic algorithms. It provides a flexible set of tools for implementing genetic algorithms search in both the continuous and discrete case, whether constrained or not. [5]

References[edit | edit source]

Citations[edit | edit source]

Sources[edit | edit source]


Previous: Mathematics Index Next: Probability Distributions


Probability Distributions

This page review the main probability distributions and describe the main R functions to deal with them.

R has lots of probability functions.

  • r is the generic prefix for random variable generator such as runif(), rnorm().
  • d is the generic prefix for the probability density function such as dunif(), dnorm().
  • p is the generic prefix for the cumulative density function such as punif(), pnorm().
  • q is the generic prefix for the quantile function such as qunif(), qnorm().

Discrete distributions[edit | edit source]

Benford Distribution[edit | edit source]

The Benford distribution is the distribution of the first digit of a number. It is due to Benford 1938[1] and Newcomb 1881[2].

> library(VGAM)
> dbenf(c(1:9))
[1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679 0.05799195 0.05115252 0.04575749

Bernoulli[edit | edit source]

We can draw from a Bernoulli using sample(), runif() or rbinom() with size = 1.

> n <- 1000
> x <- sample(c(0,1), n, replace=T)
> x <- sample(c(0,1), n, replace=T, prob=c(0.3,0.7))
> x <- runif(n) > 0.3
> x <- rbinom(n, size=1, prob=0.2)

Binomial[edit | edit source]

We can sample from a binomial distribution using the rbinom() function with arguments n for number of samples to take, size defining the number of trials and prob defining the probability of success in each trial.

> x <- rbinom(n=100,size=10,prob=0.5)

Hypergeometric distribution[edit | edit source]

We can sample n times from a hypergeometric distribution using the rhyper() function.

> x <- rhyper(n=1000, 15, 5, 5)

Geometric distribution[edit | edit source]

The geometric distribution.

> N <- 10000
> x <- rgeom(N, .5)
> x <- rgeom(N, .01)

Multinomial[edit | edit source]

The multinomial distribution.

> sample(1:6, 100, replace=T, prob= rep(1/6,6))

Negative binomial distribution[edit | edit source]

The negative binomial distribution is the distribution of the number of failures before k successes in a series of Bernoulli events.

> N <- 100000
> x <- rnbinom(N, 10, .25)

Poisson distribution[edit | edit source]

We can draw n values from a Poisson distribution with a mean set by the argument lambda.

> x <- rpois(n=100, lambda=3)

Zipf's law[edit | edit source]

The distribution of the frequency of words is known as Zipf's Law. It is also a good description of the distribution of city size[3]. dzipf() and pzipf() (VGAM)

> library(VGAM)
> dzipf(x=2, N=1000, s=2)

Continuous distributions[edit | edit source]

Beta and Dirichlet distributions[edit | edit source]

>library(gtools)
>?rdirichlet
>library(bayesm)
>?rdirichlet
>library(MCMCpack)
>?Dirichlet

Cauchy[edit | edit source]

We can sample n values from a Cauchy distribution with a given location parameter (default is 0) and scale parameter (default is 1) using the rcauchy() function.

> x <- rcauchy(n=100, location=0, scale=1)

Chi Square distribution[edit | edit source]

Quantile of the Chi-square distribution ( distribution)

> qchisq(.95,1)
[1] 3.841459
> qchisq(.95,10)
[1] 18.30704
> qchisq(.95,100)
[1] 124.3421

Exponential[edit | edit source]

We can sample n values from a exponential distribution with a given rate (default is 1) using the rexp() function

> x <- rexp(n=100, rate=1)

Fisher-Snedecor[edit | edit source]

We can draw the density of a Fisher distribution (F-distribution) :

> par(mar=c(3,3,1,1))
> x <- seq(0,5,len=1000)
> plot(range(x),c(0,2),type="n")
> grid()
> lines(x,df(x,df1=1,df2=1),col="black",lwd=3)
> lines(x,df(x,df1=2,df2=1),col="blue",lwd=3)
> lines(x,df(x,df1=5,df2=2),col="green",lwd=3)
> lines(x,df(x,df1=100,df2=1),col="red",lwd=3)
> lines(x,df(x,df1=100,df2=100),col="grey",lwd=3)
> legend(2,1.5,legend=c("n1=1, n2=1","n1=2, n2=1","n1=5, n2=2","n1=100, n2=1","n1=100, n2=100"),col=c("black","blue","green","red","grey"),lwd=3,bty="n")

Gamma[edit | edit source]

We can sample n values from a gamma distribution with a given shape parameter and scale parameter using the rgamma() function. Alternatively a shape parameter and rate parameter can be given.

> x <- rgamma(n=10, scale=1, shape=0.4)
> x <- rgamma(n=100, scale=1, rate=0.8)

Levy[edit | edit source]

We can sample n values from a Levy distribution with a given location parameter (defined by the argument m, default is 0) and scaling parameter (given by the argument s, default is 1) using the rlevy() function.

> x <- rlevy(n=100, m=0, s=1)

Log-normal distribution[edit | edit source]

We can sample n values from a log-normal distribution with a given meanlog (default is 0) and sdlog (default is 1) using the rlnorm() function

> x <- rlnorm(n=100, meanlog=0, sdlog=1)

Normal and related distributions[edit | edit source]

We can sample n values from a normal or gaussian Distribution with a given mean (default is 0) and sd (default is 1) using the rnorm() function

> x <- rnorm(n=100, mean=0, sd=1)

Quantile of the normal distribution

> qnorm(.95)
[1] 1.644854
> qnorm(.975)
[1] 1.959964
> qnorm(.99)
[1] 2.326348
  • The mvtnorm package includes functions for multivariate normal distributions.
    • rmvnorm() generates a multivariate normal distribution.
> library(mvtnorm)
> sig <- matrix(c(1, 0.8, 0.8, 1), 2, 2)
> r <- rmvnorm(1000, sigma = sig)
> cor(r) 
          [,1]      [,2]
[1,] 1.0000000 0.8172368
[2,] 0.8172368 1.0000000

Pareto Distributions[edit | edit source]

  • Generalized Pareto dgpd() in evd
  • dpareto(), ppareto(), rpareto(), qpareto() in actuar
  • The VGAM package also has functions for the Pareto distribution.

Student's t distribution[edit | edit source]

Quantile of the Student t distribution

> qt(.975,30)
[1] 2.042272
> qt(.975,100)
[1] 1.983972
> qt(.975,1000)
[1] 1.962339

The following lines plot the .975th quantile of the t distribution in function of the degrees of freedom :

curve(qt(.975,x), from = 2 , to = 100, ylab = "Quantile 0.975 ", xlab = "Degrees of freedom", main = "Student t distribution")
abline(h=qnorm(.975), col = 2)

Uniform distribution[edit | edit source]

We can sample n values from a uniform distribution (also known as a rectangular distribution] between two values (defaults are 0 and 1) using the runif() function

> runif(n=100, min=0, max=1)

Weibull[edit | edit source]

We can sample n values from a Weibull distribution with a given shape and scale parameter (default is 1) using the rweibull() function.

> x <- rweibull(n=100, shape=0.5, scale=1)

Extreme values and related distribution[edit | edit source]

plogis, qlogis, dlogis, rlogis

  • Frechet dfrechet() evd
  • Generalized Extreme Value dgev() evd
  • Gumbel dgumbel() evd
  • Burr, dburr, pburr, qburr, rburr in actuar

Distribution in circular statistics[edit | edit source]

  • Functions for circular statistics are included in the CircStats package.
    • dvm() Von Mises (also known as the nircular normal or Tikhonov distribution) density function
    • dtri() triangular density function
    • dmixedvm() Mixed Von Mises density
    • dwrpcauchy() wrapped Cauchy density
    • dwrpnorm() wrapped normal density.

See also[edit | edit source]

  • Packages VGAM, SuppDists, actuar, fBasics, bayesm, MCMCpack

References[edit | edit source]

  1. Benford, F. (1938) The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78, 551–572.
  2. Newcomb, S. (1881) Note on the Frequency of Use of the Different Digits in Natural Numbers. American Journal of Mathematics, 4, 39–40.
  3. Gabaix, Xavier (August 1999). "Zipf's Law for Cities: An Explanation". Quarterly Journal of Economics 114 (3): 739–67. doi:10.1162/003355399556133. ISSN 0033-5533. http://pages.stern.nyu.edu/~xgabaix/papers/zipf.pdf.
Previous: Optimization Index Next: Random Number Generation


Random Number Generation

Random Number Generators[edit | edit source]

To a very high degree computers are deterministic and therefore are not a reliable source of significant amounts of random values. In general pseudo random number generators are used. The default algorithm in R is Mersenne-Twister but a long list of methods is available. See the help of RNGkind() to learn about random number generators.

?RNGkind

It is possible to use true random numbers. Some of them are collected on random.org (link). The random (link) package gives an access to them.

Randu[edit | edit source]

Randu is an old linear congruential pseudorandom number generator. There is a dataset generated with Randu in the datasets package. The function which is used to generate the dataset is in the help of this page.

library("datasets")
?randu

Seed[edit | edit source]

A pseudo random number generator is an algorithm based on a starting point called "seed". If you want to perform an exact replication of your program, you have to specify the seed using the function set.seed(). The argument of set.seed has to be an integer.

> set.seed(1)
> runif(1)
[1] 0.2655087
> set.seed(1)
> runif(1)
[1] 0.2655087

Sampling in a vector[edit | edit source]

Toss 10 coins

> sample(0:1,10,replace=T)
 [1] 1 0 0 0 1 0 0 1 1 1

Roll 10 dice

> sample(1:6,10,replace=T)
 [1] 4 1 5 3 2 5 5 6 3 2

play lottery (6 random numbers out of 49 without replacement)

> sample(1:49,6,replace=F)
[1] 18 35 29  1 33 11


You can sample in a multinomial distribution :

>mydat <- sample(1:4,1000,rep=TRUE,prob=c(.2,.3,.2,.3))
>table(mydat)

Sampling in a standard univariate distribution[edit | edit source]

You can use rnorm, rt, etc.

Misspecified argument[edit | edit source]

Note that if you put as argument of rnorm a vector instead of a number, R takes by default the length of the vector instead of returning an error. Here is an example :

x <- rnorm(10) # Sample a normal random vector
set.seed(1) # use the seed
z <- rnorm(x) # put a vector instead of a number as an argument of rnorm
set.seed(1) # initialize the seed again
z2 <- rnorm(length(x)) # sample in a vector with the same length as x
plot(z2,z) # check that z and z2 are the same

Inverse CDF method[edit | edit source]

  • If you know the inverse CDF (quantile function), you can generate the random variable by sampling in the standard uniform distribution and transforming using the CDF.

For instance, if you want to simulate from a standard normal distribution, you can simulate from a standard uniform and transform it using the quantile function of the normal distribution.

N <- 100
qnorm(runif(N))

This gives the same results as the rnorm() function but the computing time is higher :

> N <- 10^7
> system.time(qnorm(runif(N)))
   user  system elapsed 
   1.67    0.00    1.70 
> system.time(rnorm(N)) 
   user  system elapsed 
   1.50    0.00    1.51 

Importance sampling[edit | edit source]

Metropolis algorithm[edit | edit source]

Gibbs algorithm[edit | edit source]

Quasi random numbers[edit | edit source]

Sometimes you need to generate quasi random sequences. The randtoolbox library provides several quasi random number generators.

See also sHalton() and QUnif() (sfsmisc).

> library(randtoolbox)
> halton(10, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
        [,1]       [,2]
 [1,] 0.5000 0.33333333
 [2,] 0.2500 0.66666667
 [3,] 0.7500 0.11111111
 [4,] 0.1250 0.44444444
 [5,] 0.6250 0.77777778
 [6,] 0.3750 0.22222222
 [7,] 0.8750 0.55555556
 [8,] 0.0625 0.88888889
 [9,] 0.5625 0.03703704
[10,] 0.3125 0.37037037

You can compare Halton draws with the standard R (pseudo) random number generator. Halton draws are much more systematic.

>random <- cbind(runif(1000),runif(1000))
>halton <- halton(1000, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
>par(mfrow=c(2,2))
>plot(halton[,1],halton[,2])
>plot(random[,1],random[,2])

Examples[edit | edit source]

Resources[edit | edit source]

References[edit | edit source]


Previous: Probability Distributions Index Next: Control Structures


Maximum Likelihood

Introduction[edit | edit source]

Maximum likelihood estimation is just an optimization problem. You have to write down your log likelihood function and use some optimization technique. Sometimes you also need to write your score (the first derivative of the log likelihood) and or the hessian (the second derivative of the log likelihood).

One dimension[edit | edit source]

If there is only one parameter, we can optimize the log likelihood using optimize().

Example with a type 1 Pareto distribution[edit | edit source]

We provide an example with a type 1 Pareto distribution. Note that in this example we treat the minimum as known and do not estimate it. Therefore this is a one-dimensional problem.

We use the rpareto1() (actuar) function to generate a random vector from a type 1 Pareto distribution with shape equal to 1 and minimum value equal to 500. We use the dpareto1() (actuar) function with option log = TRUE to write the log likelihood. Then we just need to use optimize() with maximum=TRUE. We provide a minimum and a maximum value for the parameter with the interval option.

> library(actuar)
> y <- rpareto1(1000, shape = 1, min = 500)
> ll <- function(mu, x) { 
+    sum(dpareto1(x,mu[1],min = min(x),log = TRUE)) 
+   } 
> optimize(f = ll, x = y, interval = c(0,10), maximum = TRUE)

Multiple dimension[edit | edit source]

  • fitdistr() (MASS package) fits univariate distributions by maximum likelihood. It is a wrapper for optim().
  • If you need to program yourself your maximum likelihood estimator (MLE) you have to use a built-in optimizer such as nlm(), optim(). R also includes the following optimizers :
  • mle() in the stats4 package
  • The maxLik package


Example with a logistic distribution[edit | edit source]

For instance, we draw from a logistic distribution and we estimate the parameters using .

> # draw from a gumbel distribution using the inverse cdf simulation method
> e.1 <- -log(-log(runif(10000,0,1))) 
> e.2 <- -log(-log(runif(10000,0,1)))
> u <- e.2 - e.1  # u follows a logistic distribution (difference between two gumbels.)
> fitdistr(u,densfun=dlogis,start=list(location=0,scale=1))

Example with a Cauchy distribution[edit | edit source]

For instance, we can write a simple maximum likelihood estimator for a Cauchy distribution using the nlm() optimizer. We first draw a vector x from a Cauchy distribution. Then we define the log likelihood function and then we optimize using the nlm() function. Note that nlm() is minimizer and not a maximizer.

> n <- 100
> x <- rcauchy(n)
> mlog.1 <- function(mu, x) { 
+   - sum(dcauchy(x, location = mu, log = TRUE)) 
+   } 
> mu.start <- median(x)
> out <- nlm(mlog.1, mu.start, x = x)


Example with a beta distribution[edit | edit source]

Here is an other example with the Beta distribution and the optim() function.

> y <- rbeta(1000,2,2)
> loglik <- function(mu, x) { 
+    sum(-dbeta(x,mu[1],mu[2],log = TRUE)) 
+    } 
> 
> out <- optim(par = c(1,1), fn=loglik,x=y,method = "L-BFGS-B",lower=c(0,0))

Tests[edit | edit source]

Likelihood Ratio Test[edit | edit source]

  • lrtest() in the lmtest package[1].


Some Specific cases[edit | edit source]

  • gum.fit() (ismev package) provides MLE for a Gumbel distributon


Resources[edit | edit source]

References[edit | edit source]

  1. Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R News 2(3), 7-10. URL http://CRAN.R-project.org/doc/Rnews/


Previous: Linear Models Index Next: Bayesian Methods


Method of Moments

  • Package gmm implements the generalized method of moment and the generalized empirical likelihood.

First, it is possible to estimate a simple linear model or a simple linear model with instrumental variables using the gmm() function. The GMM method is often used to estimate heteroskedastic instrumental variable models.

> # Simple linear model
> N <- 1000
> u <- rnorm(N)
> x <- 1 + rnorm(N)
> y <- 1 + x + u
> res <- gmm(y ~ x, x)

> # Simple linear model with instrumental variables.
> library(gmm)
> N <- 1000
> u <- rnorm(N)
> z <- rnorm(N)
> x <- 1 + z + u + rnorm(N)
> y <- 1 + x + u
> res <- gmm(y ~ x, z)
> summary(res)


Bayesian Methods

Introduction[edit | edit source]

R has lots of bayesian estimation procedures, much more than Stata or SAS.

  • LearnBayes by Jim Albert
  • bayesm by Peter Rossi and Rob McCulloch
  • BaM by Jeff Gill
  • arm package by Jennifer Hill and Andrew Gelman.
  • MCMCpack package.
  • mcsm package by Christian Robert and George Casella.
  • umacs (link) by Jouni Kerman and Andrew Gelman.

Interface with WinBugs[edit | edit source]

  • WinBugs/OpenBugs is a popular statistical package for MCMC techniques.
  • Andrew Gelman has some instruction to use R and WinBugs on his webpage
  • There is also an interface with JAGS

Resources[edit | edit source]

References[edit | edit source]


Previous: Maximum Likelihood Index Next: Bootstrap


Bootstrap


  • boot package includes functions from the book Bootstrap Methods and Their Applications by A. C. Davison and D. V. Hinkley (1997, CUP)
  • bootstrap package.

Quick how-to[edit | edit source]

Do a bootstrap of some data for some function (here, mean):

 b <- boot(data, function(data, id) { mean(data[id]) }, 1000)

Use this to compute a 90%-confidence interval:

 boot.ci(b, .9, type="norm")

References[edit | edit source]

Previous: Bayesian Methods Index Next: Nonparametric Methods


Multiple Imputation

Multiple imputation of missing data generally includes two steps. First, an imputation step which results in multiple complete datasets. Second, combining the results obtained by applying the chosen technique on each separate dataset. The packages needed for these two steps are not necessary the same, but can be.


References[edit | edit source]


Nonparametric Methods

This page deals with a set of non-parametric methods including the estimation of a cumulative distribution function (CDF), the estimation of probability density function (PDF) with histograms and kernel methods and the estimation of flexible regression models such as local regressions and generalized additive models.

For an introduction to nonparametric methods you can have a look at the following books or handout :

  • Nonparametric Econometrics: A Primer by Jeffrey S. Racine[1].
  • Li and Racine's handbook, Nonparametric econometrics[2].
  • Larry Wasserman All of Nonparamatric Statistics[3]

Empirical distribution function[edit | edit source]

  • The easiest way to estimate the empirical CDF uses the rank() and the length() functions.
  • ecdf() computes the empirical cumulative distribution function.
  • ecdf.ksCI() (sfsmisc) plots the empirical distribution function with confidence intervals.
> N <- 1000
> x <- rnorm(N)
> edf <- rank(x)/length(x)
> plot(x,edf)
> plot(ecdf(x),xlab = "x",ylab = "Distribution of x")
> grid()
> library("sfsmisc")
> ecdf.ksCI(x1)


Density Estimation[edit | edit source]

Histogram[edit | edit source]

  • hist() is the standard function for drawing histograms. If you store the histogram as an object the estimated parameters are returned in this object.
> x <- rnorm(1000)
> hist(x, probability = T) # The default uses Sturges method.
> # Sturges, H. A. (1926) The choice of a class interval.
> # Journal of the American Statistical Association 21, 65–66. 
> hist(x, breaks = "Sturges", probability = T)
> 
> # Freedman, D. and Diaconis, P. (1981) On the histogram as a density estimator: L_2 theory.
> # Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 453–476. 
> # (n^1/3 * range)/(2 * IQR).
> hist(x, breaks = "FD", probability = T)
> 
> # Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605–610. 
> # ceiling[n^1/3 * range/(3.5 * s)].
> hist(x, breaks = "scott", probability = T)
> 
> # Wand, M. P. (1995). Data-based choice of histogram binwidth.
> # The American Statistician, 51, 59–64. 
> library("KernSmooth")
> h <- dpih(x)
> bins <- seq(min(x)-h, max(x)+h, by=h)
> hist(x, breaks=bins, probability = T)

It is also possible to choose the break points.

> x <- rnorm(1000)
> hist(x, breaks = seq(-4,4,.1))
  • n.bins() (car package) includes several methods to compute the number of bins for an histogram.
  • histogram() (lattice)
  • truehist() (MASS)
  • hist.scott() (MASS) plot a histogram with automatic bin width selection, using the Scott or Freedman–Diaconis formulae.
  • histogram package.

Kernel Density Estimation[edit | edit source]

  • density() estimates the kernel density of a vector.
    • Choose the bandwidth selection method with bw.
    • Check the sensitivity of the bandwidth choice using adjust. The default is one. It is good practice to look at adjust=.5 and adjust=2.
> x <- rnorm(10^3)
> plot(density(x,bw = "nrd0", adjust = 1, kernel = "gaussian"), col = 1)
> lines(density(x,bw = "nrd0", adjust = .5, kernel = "gaussian"), col = 2)
> lines(density(x,bw = "nrd0", adjust = 2, kernel = "gaussian"), col = 3)
> legend("topright", legend = c("adjust = 1", "adjust = .5", "adjust = 2"), col = 1:3, lty = 1)
    • Choose the kernel function with kernel : "gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine".
> x <- rnorm(10^3)
> plot(density(x,bw = "nrd0", adjust = 1, kernel = "gaussian"), col = 1)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "epanechnikov"), col = 2)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "rectangular"), col = 3)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "triangular"), col = 3)
> legend("topright", legend = c("gaussian", "epanechnikov", "rectangular",  "triangular"), col = 1:4, lty = 1)
  • tkdensity() (sfsmisc) is a nice function which allow to dynamically choose the kernel and the bandwidth with a handy graphical user interface. This is a good way to check the sensitivity of the bandwidth and/or kernel choice on the density estimation.
> x  <- rnorm(10^3)
> library("sfsmisc")
> tkdensity(x)
  • kde2d() (MASS) estimates a bivariate kernel density.
> N <- 1000
> x <- rnorm(N)
> y <- 1 + x^2 + rnorm(N)
> dd <-  kde2d(y,x) # estimate the bivariate kernel
> contour(dd) # plot the bivariate density
> image(dd) # another plot the bivariate density

Examples[edit | edit source]

Local Regression[edit | edit source]

  • loess() is the standard function for local linear regression.
  • lowess() is similar to loess() but does not have a standard syntax for regression y ~ x .This is the ancestor of loess (with different defaults!).
  • ksmooth() (stats) computes the Nadaraya–Watson kernel regression estimate.
  • locpoly() (KernSmooth package)
  • npreg() (np package)
  • locpol computes local polynomial estimators
  • locfit local regression, likelihood and density estimation


Examples[edit | edit source]

Generalized additive semiparametric models (GAM)[edit | edit source]

  • gam() (gam)
  • gam() (mgcv)
> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> y <- 1 + x1^2 + x2^3 + u
> 
> library(gam)
> g1 <- gam(y ~ x1 + x2 ) # Standard linear model
> par(mfrow=c(1,2))
> plot(g1, se = T)
> 
> g1 <- gam(y ~ s(x1) + x2 ) # x1 is locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
> 
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
> 
> library(mgcv)
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)


References[edit | edit source]

  1. Jeffrey S. Racine Nonparametric Econometrics: A Primer http://socserv.mcmaster.ca/racine/ECO0301.pdf and at the R code examples http://socserv.mcmaster.ca/racine/primer_code.zip
  2. Qi Li, Jeffrey S. Racine, Nonparametric econometrics, Princeton University Press - 2007
  3. Wasserman, Larry, "All of nonparametric statistics", Springer (2007) (ISBN: 0387251456)


Previous: Bootstrap Index Next: Quantile Regression


Linear Models

Standard linear model[edit | edit source]

In this section we present estimation functions for the standard linear model estimated by ordinary least squares (OLS). Heteroskedasticity and endogeneity are treated below. The main estimation function is lm().

Fake data simulations[edit | edit source]

We first generate a fake dataset such that there is no hetereoskedasticity, no endogeneity and no correlation between the error terms. Therefore the ordinary least square estimator is unbiased and efficient. We choose a model with two variables and take all the coefficients equal to one.

> N <- 1000
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- 1 + x1 + rnorm(N)
> y <- 1 + x1 + x2 + u
> df <- data.frame(y,x1,x2)

Least squares estimation[edit | edit source]

  • The standard function to estimate a simple linear model is lm().
  • lsfit() performs the least square procedure but the output is not formatted in fashionable way.
  • ols() (Design) is another alternative.

We estimate the model using lm(). We store the results in fit and print the result using summary() which is the standard function.

> fit <- lm(y ~ x1 + x2, data = df)
> summary(fit)

There are some alternative to display the results.

  • display() in the arm package is one of them.
  • coefplot() (arm) graphs the estimated coefficients with confidence intervals. This is a good way to present the results.
  • mtable() in the memisc package can display the results of a set of regressions in the same table.
> library("arm")
> display(fit)
> coefplot(fit)

fit is a list of objects. You can see the list of these objects by typing names(fit). We can also apply functions to fit.

We can get the estimated coefficients using fit$coeff or coef(fit).

> fit$coeff
(Intercept)          x1          x2 
  1.2026522   0.8427403   1.5146775
> coef(fit)
(Intercept)          x1          x2 
     0.7541      1.7844      0.7222 
> output <- summary(fit)
> coef(output) 
             Estimate Std. Error  t value    Pr(>|t|)
(Intercept) 1.1945847  0.2298888 5.196359 0.001258035
x1          0.6458170  0.3423214 1.886581 0.101182585
x2          0.6175165  0.2083628 2.963660 0.020995713

se.coef() (arm) returns the standard error of the estimated coefficients.

The vector of fitted values can be returned via fit$fitted, fitted(fit) or the predict() function. The predict() function also returns standard error and confidence intervals for predictions.

 
> fit$fitted
> fitted(fit)

The vector of residuals:

> fit$resid
> residuals(fit)

The number of degrees of freedom :

> fit$df

Confidence intervals[edit | edit source]

We can get the confidence intervals using confint() or conf.intervals() in the alr3 package.

> confint(fit, level = .9)
                   5 %     95 %
(Intercept) -0.7263261 1.200079
x1          -0.5724022 1.909924
x2           0.6185011 2.475079
> confint(fit, level = .95)
                 2.5 %   97.5 %
(Intercept) -0.9652970 1.439050
x1          -0.8803353 2.217858
x2           0.3881923 2.705388
> confint(fit, level = .99)
                 0.5 %   99.5 %
(Intercept) -1.5422587 2.016012
x1          -1.6237963 2.961319
x2          -0.1678559 3.261436
> library(alr3)
> conf.intervals(fit)
                 2.5 %   97.5 %
(Intercept) -0.9652970 1.439050
x1          -0.8803353 2.217858
x2           0.3881923 2.705388

Tests[edit | edit source]

coeftest() (lmtest) performs the Student t test and z test on coefficients.

> library("lmtest")
> coeftest(fit) # t-test
> coeftest(fit,df=Inf) # z-test (for large samples)

linear.hypothesis() (car) performs a finite sample F test on a linear hypothesis or an asymptotic Wald test using statistics.

> library("car")
> linear.hypothesis(fit,"x1 = x2") # tests Beta1 = Beta2
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(1,3)) # Tests  Beta0 = Beta1 = Beta2 = 1
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(0,3)) # Tests  Beta0 = Beta1 = Beta2 = 0
> linear.hypothesis(fit,c("x1","x2"),rep(0,2)) # Tests Beta1 = Beta2 = 0

See also waldtest() (lmtest) for nested models.

Analysis of variance[edit | edit source]

We can also make an analysis of variance using anova().

> anova(fit)

Model Search and information criteria[edit | edit source]

> # Akaike Information Criteria
> AIC(fit)
[1] 26.72857
> # Bayesian Information Criteria
> AIC(fit,k=log(N))
[1] 27.93891

The stats4 package includes AIC() and BIC() function:

> library(stats4)
> ?BIC
> lm1 <- lm(Fertility ~ . , data = swiss)
> AIC(lm1)
[1] 326.0716
> BIC(lm1)
[1] 339.0226

The step() functions performs a model search using the Akaike Information Criteria.

> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> x3 <- rnorm(N)
> y <- 1+ x1 + x2 + u
> fit <- lm(y~x1+x2 + x3)
> step.fit <- step(fit)

Zelig[edit | edit source]

  • The method is also supported in Zelig
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> z.out <- zelig(y ~  x, model = "ls", data = mydat)
> x.out <- setx(z.out, x = 10)
> s.out <- sim(z.out, x.out)
> summary(s.out)

Bayesian estimation[edit | edit source]

  • MCMCregress() (MCMCpack)
  • BLR() (BLR)
> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> 
> posterior <- MCMCregress(y ~ x, data = mydat)
> summary(posterior)
> plot(posterior)

Heteroskedasticity[edit | edit source]

  • See the lmtest and sandwich packages.
  • gls() (nlme) computes the generalized least squares estimator.
  • See "Cluster-robust standard errors using R" (pdf) by Mahmood Arai. He suggests two functions for cluster robust standard errors. clx() allow for one-way clustering and mclx() for two-way clustering. They can be loaded with the following command source("http://people.su.se/~ma/clmclx.R").
> N <- 10 # 10 people
> T <- 5 # 5 times
> id <- rep(1:N,T)
> f <- rep(rnorm(N),T) # is individual specific
> u <- rnorm(N*T)
> x1 <- rnorm(N*T) 
> x2 <- rnorm(N*T) + x1
> y <- 1 + x1 + x2 + f + u
> fit <- lm(y ~ x1 + x2 )
> source("http://people.su.se/~ma/clmclx.R")
> clx(fit, 1, id)

Robustness[edit | edit source]

Cook's distance

>library(car)
> cookd(fit)
           1            2            3            4            5 
0.0006205008 0.0643213760 0.2574810866 1.2128206779 0.2295047699 
           6            7            8            9           10 
0.3130578329 0.0003365221 0.0671830241 0.0048474954 0.0714255871

Influence plot:

> influence.plot(fit)

Leverage plots:

> leverage.plot(fit,term.name=x1)
> leverage.plot(fit,term.name=x2)

Bonferroni's outlier test:

> outlier.test(fit)

max|rstudent| = 2.907674, degrees of freedom = 6,
unadjusted p = 0.02706231, Bonferroni p = 0.2706231

Observation: 3

See also outlier.t.test() in the alr3 package.

  • inf.index() in the alr3 package computes all the robustness statistics (Cook's distance, studentized residuals, outlier test, etc)
  • rlm() performs a robust estimation

Instrumental Variables[edit | edit source]

  • ivreg() in the AER package[1]
  • tsls() in the sem package.
  • It is also possible to use the gmm() command in the gmm package. See Methods of moments for an example.

Fake data simulations[edit | edit source]

We first simulate a fake data set with x correlated to u, z and u independent and x correlated with z. Thus x is an endogenous explanatory variable of y and z is a valid instrument for x.

> N <- 1000
> z <- rnorm(N)
> u <- rnorm(N) 
> x <- 1 + z + u + rnorm(N) # x is correlated with the error term u (endogeneity) and the instrument z
> y <- 1 + x + u

Two stage least squares[edit | edit source]

Then we estimate the model with OLS (lm()) and IV using z as an instrument for x.

> ols <- lm(y ~ x)
> summary(ols) # ols are biased
> library("AER")
> iv <- ivreg(y ~ x | z)
> summary(iv) # IV estimates are unbiased
> library("sem")
> iv2 <- tsls(y  ~ x, instruments = ~ z)
> summary(iv2)
> library("gmm")
> iv3 <- gmm(y ~ x, z)
> summary(iv3)

We plot the results :

> plot(y ~ x, col = "gray")
> abline(a  = 1,b = 1, lty = 1, col = 1, lwd = 2)
> abline(ols,  lty = 2, col = 2 , lwd = 2)
> abline(iv, lty = 3, col = 3, lwd = 2)
> legend("topleft", legend = c("True values","OLS","IV"), col = 1:3, lwd = rep(2,3), lty = 1:3)

Panel Data[edit | edit source]

plm() (plm) implements the standard random effect, fixed effect, first differences methods[2]. It is similar to Stata's xtreg command.

Note that plm output are not compatible with xtable() and mtable() for publication quality output.

  • lme4 and gee implements random effect and multilevel models.
  • See also BayesPanel

Random effects model[edit | edit source]

To implement a random effects model we generate a fake data set with 1000 observations over 5 time periods.

> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) 
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])

We estimate the random effect model with the plm() function and the model = "random" option.

> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> # panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> re <- plm(eq, model = "random", data=panel)
> summary(re)

Fixed effects model[edit | edit source]

For a fixed effects model we generate a fake dataset and we correlate the fixed effects f with covariates :

> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) + long$f
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])

We first transform our data in a plm data frame using plm.data(). We estimate the fixed model using plm() with model = "within" as an option. Then, we compare the estimate with the random effect model and perform an Hausman test. At the end, we plot the density of the fixed effects.

> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> #panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> fe <- plm(eq, model = "within", data=panel)
> summary(fe)
> re <- plm(eq, model = "random", data=panel)
> summary(re)
> phtest(fe, re)
> plot(density(fixef(fe)))
> rug(fixef(fe))

Dynamic panel data[edit | edit source]

  • pgmm() (plm) implements the Arellano Bond estimation procedure[3]. It is similar to xtabond2 in Stata[4].

Simultaneous equations model[edit | edit source]

For a [:w:Simultaneous_equations_model|simultaneous equations model] the following packages are needed :

  • sem package
  • systemfit package

References[edit | edit source]

  1. Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER
  2. Yves Croissant, Giovanni Millo (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software 27(2). URL http://www.jstatsoft.org/v27/i02/.
  3. M Arellano, S Bond "Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations" - The Review of Economic Studies, 1991
  4. David Roodman, XTABOND2: Stata module to extend xtabond dynamic panel data estimator, http://ideas.repec.org/c/boc/bocode/s435901.html

External links[edit | edit source]

Previous: Descriptive Statistics Index Next: Maximum Likelihood


Quantile Regression

Quantile regression is a very old method which has become popular only in the last years thanks to computing progress. One of the main researcher in this area is also a R practitioner and has developed a specific package for quantile regressions (quantreg)[1] ·[2].

In theory, Quantile regression are also linear and thus could have been included in the Linear regression page. However, this is a very specific topic and we think that it is worth writing a specific page for this topic.

Simple quantile model[edit | edit source]

We simulate from a simple quantile model. We first generate a uniform error term u and a covariate x.

N <- 10^3
u <- runif(N)
x <- 1 + rnorm(N)
y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x

We estimate the quantile model for some values of tau (the quantile) and plot the coefficients :

q1 <- rq(y ~ x, tau = seq(.1,.9,.1))
summary(q1)
plot(q1)

We then plot the scatterplot, the predicted values using a standard linear model and the predicted values using a quantile linear model :

plot(x,y, col = "grey")
m1 <- lm(y ~ x)
abline(m1, col = "red")
taus <- seq(.1,.9,.1)
for (i in 1:length(taus)){
	abline(rq(y ~ x, tau = taus[i]), col = "blue")
	}
grid()

We can also estimate the model for all quantiles at the same time :

q2 <- rq(y ~ x, tau = -1)
plot(q2, nrow = 2, ncol = 1)

Computing time[edit | edit source]

For large data sets it is better to use the "fn" or "pfn" method.

> N <- 10^5
> u <- runif(N)
> x <- 1 + rnorm(N)
> y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x
> system.time(rq(y ~ x, tau = .5, method = "br"))
   user  system elapsed 
   1.48    0.00    1.48 
> system.time(rq(y ~ x, tau = .5, method = "fn"))
   user  system elapsed 
   0.60    0.00    0.61 
> system.time(rq(y ~ x, tau = .5, method = "pfn")) 
   user  system elapsed 
   0.30    0.00    0.29

Resources[edit | edit source]

References[edit | edit source]

  1. Roger Koenker (2010). quantreg: Quantile Regression. R package version 4.50. http://CRAN.R-project.org/package=quantreg
  2. Roger Koenker's personal webpage
Previous: Nonparametric Methods Index Next: Binomial Models


Binomial Models

In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory variables.

This kind of model can be analyzed using a linear probability model. However a drawback of this model for the parameter of the Bernoulli distribution is that, unless restrictions are placed on , the estimated coefficients can imply probabilities outside the unit interval . For this reason, models such as the logit model or the probit model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models page.

Logit model[edit | edit source]

The model takes the form : with the inverse link function : . It can be estimated using maximum likelihood or using bayesian methods.

Fake data simulations[edit | edit source]

> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)

Maximum likelihood estimation[edit | edit source]

  • The standard way to estimate a logit model is glm() function with family binomial and link logit.
  • lrm() (Design) is another implementation of the logistic regression model.
  • There is an implementation in the Zelig package[1].

In this example, we simulate a model with one continuous predictor and estimate this model using the glm() function.

> res <- glm(y ~ x , family  = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res) 
> exp(res$coefficients) # odds ratio
> exp(confint(res)) # Confidence intervals for odds ratio (delta method)
> predict(res) # prediction on a linear scale
> predict(res, type = "response") # predicted probabilities
> plot(x, predict(res, type = "response")) # plot the predicted probabilities

Zelig[edit | edit source]

The Zelig' package makes it easy to compute all the quantities of interest.

We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we estimate the model using zelig() with the model = "logit" option.

  • We the look at the predicted values of y at the mean of x1 and x2
  • Then we look at the predicted values when x1 = 0 and x2 = 0
  • We also look at what happens when x1 changes from the 3rd to the 1st quartile.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
> 
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest

> # the same with other values
> x.out <- setx(z.out, x1 = 0, x2 = 0)
> s.out <- sim(z.out, x = x.out)
> summary(s.out)

> # What happens if x1 change from the 3rd quartile to the 1st quartile ? 
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2)) 
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2)) 
> s.out2<-sim(z.out, x=x.high, x1=x.low) 
> plot(s.out2)
  • ROC Curve in the verification package.
  • Zelig has a rocplot() function.

Bayesian estimation[edit | edit source]

  • bayesglm() in the arm package
  • MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.
> # Data generating process
> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> 
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)

> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)

Probit model[edit | edit source]

The probit model is a binary model in which we assume that the link function is the cumulative density function of a normal distribution.

We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter). Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and we draw the binary variable y as Bernouilli random variable.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)

Maximum likelihood[edit | edit source]

We can use the glm() function with family=binomial(link=probit) option or the probit() function in the sampleSelection package which is a wrapper of the former one.

> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
> 
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)

Bayesian estimation[edit | edit source]

  • MCMCprobit() (MCMCpack)
> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)

See Also[edit | edit source]

  • There is an example of a probit model with R on the UCLA statistical computing website[2].

Semi-Parametric models[edit | edit source]

References[edit | edit source]

  1. Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig
  2. UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm
  3. Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.
  4. Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.
Previous: Quantile Regression Index Next: Multinomial Ordered Models


Multinomial Models

Multinomial Logit[edit | edit source]

  • mlogit package.
  • mnlogit package
  • Bayesm package
  • multinom() nnet
  • multinomial(), which is used by vglm() VGAM

Conditional Logit[edit | edit source]

  • clogit() in the survival package
  • mclogit package.


Multinomial Probit[edit | edit source]

  • mprobit package [1]
  • MNP package to fit a multinomial probit.


Multinomial ordered logit model[edit | edit source]

We consider a multinomial ordered logit model with unknown thresholds. First, we simulate fake data. We draw the residuals in a logistic distribution. Then we draw some explanatory variable x and we define ys the latent variable as a linear function of x. Note that we set the constant to 0 because the constant and the thresholds cannot be identified simultaneously in this model. So we need to fix one of the parameters. Then, we define thresholds (-1,0,1) and we define our observed variable y using the cut() function. So y is an ordered multinomial variable.

N <- 10000
u <- rlogis(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(y,x)


Maximum likelihood estimation[edit | edit source]

This model can be estimated by maximum likelihood using the polr() function in the MASS package. Since it is not possible to achieve identification of the constant and the thresholds, R assumes by default that the constant is equal to 0.

library(MASS)
fit <- polr(y  ~ x, method = "logistic", data = df)
summary(fit)


Bayesian estimation[edit | edit source]

  • bayespolr() (arm) performs a bayesian estimation of the multinomial ordered logit
library("arm")
fit <- bayespolr(y ~ x, method = "logistic", data = df)
summary(fit)

Multinomial ordered probit model[edit | edit source]

We generate fake data by drawing an error term in normal distribution and cutting the latent variables in 4 categories.

N <- 1000
u <- rnorm(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(x,y)


Maximum likelihood estimation[edit | edit source]

The model can be fitted using maximum likelihood method. This can be done using the polr() function in the MASS package with the probit method.

library(MASS)
fit <- polr(y  ~ x, method = "probit", data = df)
summary(fit)


Bayesian estimation[edit | edit source]

  • bayespolr() (arm) performs a bayesian estimation of the multinomial ordered probit


Rank Ordered Logit Model[edit | edit source]

This model was introduced in econometrics by Beggs, Cardell and Hausman in 1981.[2][3] One application is the Combes et alii paper explaining the ranking of candidates to become professor.[3] Is is also known as Plackett–Luce model in biomedical literature or as exploded logit model in marketing.[3]

Conditionally Ordered Hierarchical Probit[edit | edit source]

  • The Conditionally Ordered Hierarchical Probit can be estimated using the anchors package developped by Gary King and his coauthors[4].

References[edit | edit source]

  1. Harry Joe, Laing Wei Chou and Hongbin Zhang (2006). mprobit: Multivariate probit model for binary/ordinal response. R package version 0.9-2.
  2. Beggs, S; Cardell, S; Hausman, J (1981). "Assessing the potential demand for electric cars". Journal of Econometrics. 17: 1–19. doi:10.1016/0304-4076(81)90056-7.
  3. a b c Combes, Pierre-Philippe; Linnemer, Laurent; Visser, Michael (2008). "Publish or peer-rich? The role of skills and networks in hiring economics professors". Labour Economics. 15 (3): 423–41. doi:10.1016/j.labeco.2007.04.003.
  4. Jonathan Wand, Gary King, Olivia Lau (2009). anchors: Software for Anchoring Vignette Data. Journal of Statistical Software, Forthcoming. URL http://www.jstatsoft.org/.


Tobit And Selection Models

Tobit (type 1 Tobit)[edit | edit source]

In this section, we look at simple tobit model where the outcome variable is observed only if it is above or below a given threshold.

  • tobit() in the AER package[1]. This is a wrapper for survreg().
N <- 1000
u <- rnorm(N)
x <- - 1 + rnorm(N)
ystar <- 1 + x + u
y <- ystar*(ystar > 0)
hist(y)

ols <- lm(y ~ x)
summary(ols)
#Plot a correlation matrix and scatter plot
library(GGally)
library(ggplot2)
library(ggfortify)
ggcorr(DATA)
ggpairs(DATA)
#
M<lm(y~.)
library(ggfortify)
autoplot(M, label.size = 3)
#












library(AER)
tobit <- tobit(y ~ x,left=0,right=Inf,dist = "gaussian")

Selection models (type 2 tobit or heckit)[edit | edit source]

In this section we look at endogenous selection process. The outcome y is observe only if d is equal to one with d a binary variable which is correlated with the error term of y.

  • heckit() and selection() in sampleSelection [2]. The command is called heckit() in honor of James Heckman[3].
N <- 1000
u <- rnorm(N)
v <- rnorm(N)
x <- - 1 + rnorm(N)
z <- 1 + rnorm(N)
d <- (1 + x + z + u + v> 0)
ystar <- 1 + x + u
y <- ystar*(d == 1)
hist(y)

ols <- lm(y ~ x)
summary(ols)

library(sampleSelection)
heckit.ml <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "ml")
summary(heckit.ml)

heckit.2step <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "2step")
summary(heckit.2step)

Multi-index selection models[edit | edit source]

In this section we look at endogenous selection processes in matching markets. Matching is concerned with who transacts with whom, and how. For example, which students attend which college. The outcome y is observed only for equilibrium student-college pairs (or matches). These matches are indicated with d equal to one with d a binary variable which is correlated with the error term of y.

  • stabit() and stabit2() in matchingMarkets.[4][5] The command is called stabit() in reference to the application in stable matching markets.

Simulate two-sided matching data for 20 markets (m=20) with 100 students (nStudents=100) per market and 20 colleges with quotas of 5 students, each (nSlots=rep(5,20)). True parameters in selection and outcome equations are all equal to 1.

library(matchingMarkets)
xdata <- stabsim2(m=20, nStudents=100, nSlots=rep(5,20),
  colleges = "c1",
  students = "s1",
  outcome = ~ c1:s1 + eta + nu,
  selection = ~ -1 + c1:s1 + eta
)

Observe the bias from sorting between students and colleges.

lm1 <- lm(y ~ c1:s1, data=xdata$OUT)
summary(lm1)

Correct for sorting bias by running the Gibbs sampler in Sorensen (2007).[6]

fit2 <- stabit2(OUT = xdata$OUT,
           colleges = "c1",
           students = "s1",
           outcome = y ~ c1:s1, 
           selection = ~ -1 + c1:s1,
           niter=1000
)
summary(fit2)

Truncation[edit | edit source]

  • truncreg package
  • DTDA "An R package for analyzing truncated data" pdf.

References[edit | edit source]

  1. Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER
  2. Sample Selection Models in R: Package sampleSelection http://www.jstatsoft.org/v27/i07
  3. James Heckman "Sample selection bias as a specification error", Econometrica: Journal of the econometric society, 1979
  4. Klein, T. (2015). "Analysis of Stable Matchings in R: Package matchingMarkets" (PDF). Vignette to R Package matchingMarkets.
  5. "matchingMarkets: Analysis of Stable Matchings". R Project.
  6. Sorensen, M. (2007). "How Smart is Smart Money? A Two-Sided Matching Model of Venture Capital". Journal of Finance. 62 (6): 2725–2762.


Count Data Models


The Poisson model[edit | edit source]

Fake data simulations[edit | edit source]

We assume that y follows a poisson distribution with mean exp(1 + 1 * x). We store the data in the "df" dataframe.

N <- 1000
x <- rnorm(N)
alpha <- c(1,1)
y <- rpois(N,exp(alpha[1] + alpha[2] * x))
df <- data.frame(x,y)
plot(x,y)


Maximum likelihood[edit | edit source]

We estimate this simple model using the glm() function with family = poisson as option.

fit <- glm(y ~ x, family = poisson, data = df)
summary(fit)

Bayesian estimation[edit | edit source]

The model can also be estimated using bayesian methods with the MCMCpoisson() function which is provided in the MCMCpack.

library("MCMCpack")
posterior <- MCMCpoisson(y ~ x, data = df)
plot(posterior)
summary(posterior)

Overdispersion test[edit | edit source]

  • dispersiontest() (AER package) provides a test for equidispersion.

Zero inflated model[edit | edit source]

See the zic package[1]

Bivariate poisson regression[edit | edit source]

  • bivpois package for bivariate poisson regression.

References[edit | edit source]

  1. Markus Jochmann (2010). zic: Bayesian Inference for Zero-Inflated Count Models. R package version 0.5-3. http://CRAN.R-project.org/package=zic
  2. Cameron, A.C. and Trivedi, P.K. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press.
  3. Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER


Duration Analysis

  • Using R for Survival Analysis (pdf)
  • See the survival package
  • bootkm() Bootstrap Kaplan-Meier Estimates in Hmisc package
  • event.chart() Flexible Event Chart for Time-to-Event Data in the Hmisc package


References[edit | edit source]


Previous: Count Data Models Index Next: Time Series


Time Series

Introduction[edit | edit source]

In the following examples we will use the data set Mpyr which is included in the R-package Ecdat, which can be loaded into R and viewed in R by the following code.

#Installs the package Ecdat.
install.packages("Ecdat")
#Loads the packages Ecdat.
library(Ecdat)
#Attached the dataset Mpyr.
data(Mpyr)
#Shows the dataset Mpyr.
Mpyr
Time Series:
Start = 1900 
End = 1989 
Frequency = 1 
            m        p         y         r
1900 1.718774 2.092641 0.9030195  4.380000
1901 1.856318 2.086574 1.0131038  4.280000
1902 1.936512 2.120476 1.0114817  4.920000

Creating time-series objects[edit | edit source]

  • The function ts() is used to create time-series objects.
  • The function as.ts() coerces an object to a time-series.
  • The function is.ts() tests whether an object is a time-series.

Example:

> data.a<-seq(1,24,by=1)
> is.ts(data.a)
[1] FALSE
> ts(data.a, start=c(2005,1), frequency=12) 
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005   1   2   3   4   5   6   7   8   9  10  11  12
2006  13  14  15  16  17  18  19  20  21  22  23  24
> data.b<-seq(1,24,by=1)
> is.ts(data.b)
[1] FALSE
> is.ts(as.ts(data.b))
[1] TRUE

Creating lagged and differenced variables[edit | edit source]

  • The function lag() creates a lagged variable.
  • The function diff() creates a differenced variable.

Example:

> data.a<-seq(1,12,by=1)
> ts.a<-ts(data.a, start=c(2005,1), frequency=4)
> lag.a<-lag(ts.a,k=1)
> diff.a<-diff(ts.a,lag=1,difference=1)
> ts.a
     Qtr1 Qtr2 Qtr3 Qtr4
2005    1    2    3    4
2006    5    6    7    8
2007    9   10   11   12
> lag.a
     Qtr1 Qtr2 Qtr3 Qtr4
2004                   1
2005    2    3    4    5
2006    6    7    8    9
2007   10   11   12     
> diff.a
     Qtr1 Qtr2 Qtr3 Qtr4
2005         1    1    1
2006    1    1    1    1
2007    1    1    1    1


Plotting time-series objects[edit | edit source]

  • The function plot.ts() is used for plotting time-series objects.

Fit Autoregressive Models to Time-series by OLS[edit | edit source]

In order to fit an autoregressive time series model to the data by ordinary least squares it is possible to use the function ar.ols() which is part of the "stats" package.

Autocorrelation function[edit | edit source]

The function acf() computes (and by default plots) estimates of the autocovariance or autocorrelation function. Function pacf() is the function used for the partial autocorrelations. Function ccf() computes the cross-correlation or cross-covariance of two univariate series.[1]

Useful R-packages[edit | edit source]

  • fBasics, tis, zoo, tseries, xts, urca, forecast

References[edit | edit source]

http://cran.r-project.org/web/views/TimeSeries.html

http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf


Factor Analysis


Introduction[edit | edit source]

Factor analysis is a set of techniques to reduce the dimensionality of the data. The goal is to describe the dataset with a smaller number of variables (ie underlying factors). Factor Analysis was developed in the early part of the 20th century by L.L. Thurstone and others. Correspondence analysis was originally developed by Jean-Paul Benzécri in the 60's and the 70's. Factor analysis is mainly used in marketing, sociology and psychology. It is also known as data mining, multivariate data analysis or exploratory data analysis.

There are three main methods. Principal Component Analysis deals with continuous variables. Correspondence Analysis deals with a contingency table (two qualitative variables) and Multiple correspondence analysis is a generalization of the correspondence analysis with more than two qualitative variables. The major difference between Factor Analysis and Principal Components Analysis is that in FA, only the variance which is common to multiple variables is analysed, while in PCA, all of the variance is analysed. Factor Analysis is a difficult procedure to use properly, and is often misapplied in the psychological literature. One of the major issues in FA (and PCA) is the number of factors to extract from the data. Incorrect numbers of factors can cause difficulties with the interpretation and analysis of the data.

There are a number of techniques which can be applied to assess how many factors to extract. The two most useful are parallel analysis and the minimum average partial criterion. Parallel analysis works by simulating a matrix of the same rank as the data and extracting eigenvalues from the simulated data set. The point at which the simulated eigenvalues are greater than those of the data is the point at which the "correct" number of factors have been extracted. The Minimum Average Partial criterion uses a different approach but can often be more accurate. Simulation studies have established these two methods as the most accurate. Both of these methods are available in the psych package under the fa.parallel and the VSS commands.

Another issue in factor analysis is which rotation (if any) to choose. Essentially, the rotations transform the scores such that they are more easily interpretable. There are two major classes of rotations, orthogonal and oblique. Orthogonal rotations assume that the factors are uncorrelated, while oblique rotations allow the factors to correlate (but do not force this). Oblique rotations are recommended by some (e.g. MacCallum et al 1999) as an orthogonal solution can be obtained from an oblique rotation, but not vice versa.

One of the issues surrounding factor analysis is that there are an infinite number of rotations which explain the same amount of variance, so it can be difficult to assess which model is correct. In response to such concerns, Structural Equation Modelling (SEM), which is also known as Confirmatory Factor Analysis (CFA) was developed by Joreskeg in the 1970's. The essential principle of SEM is that given a model, it attempts to reproduce the observed covariance matrix seen in the data. The ability of a model to reproduce the data can be used as a test of that model's truth. SEM is implemented in R in the sem and lavaan packages, as well as the OpenMx package (which is not available on CRAN).

See the following packages : FactoMineR (website), amap, ade4, anacor, vegan, '"psych"'

Principal Component Analysis (PCA)[edit | edit source]

PCA deals with continuous variables

  • prcomp() in the stats package.
  • princomp() in the stats package.
  • PCA() (FactoMineR)
  • See also factanal()
  • See also fa and prcomp in the psych package
N <- 1000
factor1 <- rnorm(N)
factor2 <- rnorm(N) 
x1 <- rnorm(N) + factor1
x2 <- rnorm(N) + factor1
x3 <- rnorm(N) + factor2 
x4 <- rnorm(N) + factor2
mydat <- data.frame(x1,x2,x3,x4)
pca <- prcomp(mydat)
names(pca)
plot(pca) # plot the eigenvalues
biplot(pca) # A two dimensional plot

pca2 <- princomp(mydat)
biplot(pca2)

pca2 <- princomp(~ x1 + x2 + x3 + x4, data = mydat) # princomp with a formula syntax


Correspondence Analysis (CA)[edit | edit source]

Correspondence analysis is a tool for analyzing contingency tables.

  • corresp() MASS
  • Michael Greenacre's ca package (JSS article)
  • Correspondence Analysis and Related Network (link)
  • Quick-R's page (link)
  • Simple and Canonical Correspondence Analysis Using the R Package anacor (pdf, JSS article)
  • multiv

Multiple Correspondence Analysis (MCA)[edit | edit source]

References[edit | edit source]


Previous: Time Series Index Next: Network Analysis


Ordination

Overview[edit | edit source]

This page provides basic code for creating a distance matrix and running and plotting a Non-metric Multidimensional Scaling (NMDS) ordination.

Read more about Ordination on Wikipedia.

This code relies on package vegan in R by Jari Oksanen.

Data[edit | edit source]

First, import data and load required libraries:

require(MASS)
require(vegan)
data(varespec)   # species data
data(varechem)   # environmental data

Distance matrix[edit | edit source]

bray <- vegdist(varespec, method = "bray")				# calculate a distance matrix

# There are many distance measure options for 'dist', 
# discoverable by running '?dist'. Common distance measures include:
       # 'bray' = Bray-Curtis
       # 'canb' = Canberra
       # 'euclidean' = Euclidean

Unconstrained Ordination[edit | edit source]

Displaying dissimilarity using NMDS[edit | edit source]

NMDS analysis and plotting:

nmds <- metaMDS(varespec, k = 2, 
          distance = 'bray', autotransform = FALSE) 	# semi-black box NMDS function

ordiplot(nmds, type = "text")			      # Plot NMDS ordination
fit <- envfit(nmds, varechem[ ,1:4])			   # Calculates environmental vectors
fit						        # Lists vector endpoint coordinates and r-squared values
plot(fit)						   # adds environmental vectors
# a linear representation of environmental variables is not always appropiate
# we could also add a smooth surface of the variable to the plot
ordisurf(nmds, varechem$N, add = TRUE, col = "darkgreen")
nmds$stress                                             # stress value
resulting nmds plot


In the metaMDS function, k is user-defined and relates to how easily the projection fits the dataframe when constrained to k dimensions. Conventional wisdom seems to suggest that stress should not exceed 10-12%. Stress is reduced by increasing the number of dimensions. However, increasing dimensionality might decrease the "realism" of a 2-dimensional plot of the first two NMDS axes.


We can also run a nMDS with 3 dimensions, fit environmental vectors and create a dynamic graph:

nmds3d <- metaMDS(varespec, k = 3, 
  distance = 'bray', autotransform = FALSE)              # run nmds with 3 dimensions
nmds3d$stress                                            # stress drops
fit3d <- envfit(nmds3d, varechem[ ,1:4], choices = 1:3)  # fit environmental vectors to 3d space
ordirgl(nmds3d, envfit = fit3d)                          # dynamic 3D graph

Running a principle component analysis (PCA) on environmental data[edit | edit source]

chem_pca <- rda(varechem, scale = TRUE)    # Run PCA
biplot(chem_pca, scaling = 2)              # display biplot
PCA biplot

Constrained Ordination[edit | edit source]

Clustering

Basic clustering[edit | edit source]

K-Means Clustering[edit | edit source]

You can use the kmeans() function.

First create some data:

> dat <- matrix(rnorm(100), nrow=10, ncol=10)

To apply kmeans(), you need to specify the number of clusters:

> cl <- kmeans(dat, 3) # here 3 is the number of clusters
> table(cl$cluster)
 1  2  3 
38 44 18

Hierarchical Clustering[edit | edit source]

The basic hierarchical clustering function is hclust(), which works on a dissimilarity structure as produced by the dist() function:

> hc <- hclust(dist(dat)) # data matrix from the example above
> plot(hc)

The resulting tree can be cut using the cutree() function.

Cutting it at a given height:

> cl <- cutree(hc, h=5.1)
> table(cl)
cl
 1  2  3  4  5 
23 33 29  4 11

Cutting it to obtain given number of clusters:

> cl <- cutree(hc, k=5)
> table(cl)
cl
 1  2  3  4  5 
23 33 29  4 11

Available alternatives[edit | edit source]

References[edit | edit source]

External links[edit | edit source]


Network Analysis

Introduction[edit | edit source]

We mainly use the following packages to demonstrate network analysis in R: statnet, sna, igraph. They are however not representing a complete list. See Task view of gR, graphical models in R for a complete list.

Creating simple graphs with igraph[edit | edit source]

 
> # load the appropriate library
> library(igraph)
> # now create a few simple graphs
> # an undirected graph with 10 nodes and without any edge
> g1 <- graph.empty(10,directed=FALSE)
> # a directed graph with 10 nodes
> g2 <- graph.ring(10,directed=TRUE)
> # a complete undirected graph with 10 nodes
> g3 <- graph.full(10,directed=FALSE)
> # now get information about these graphs
> summary(g1)
> # g1 is an igraph object, U = Undirected, with 10 nodes and 0 edge
> IGRAPH U--- 10 0 -- 
> summary(g2)
> # g1 is an igraph object,  D = Directed
> IGRAPH D--- 10 10 -- Ring graph

Creating graphs from data[edit | edit source]

First load the igraph package

library(igraph)

then you can choose your preferred format. Below are examples of data provided as edge list and as adjacency matrix.

Creating graph from an edge list[edit | edit source]

An edge list is formed by a two-column matrix, with each row defining one edge. An edge is drawn from each element in the first column to the corresponding element in the second one. Use the graph.edgelist() function to import your data.

 
# producing some random data in edge list form
el <- cbind(sample(1:10, 10), sample(1:10, 10))

# creating and plotting the graph from the edge list
gr <- graph.edgelist(el)
plot(gr)

Creating graph from an adjacency matrix[edit | edit source]

An adjacency matrix is a n × n matrix containing n vertices and where each entry aij represents the number of edges from vertex i to vertex j. To import your adjacency matrix, use the graph.adjacency() function.

 
# producing a random adjacency matrix
adj <- matrix(sample(0:1, 100, replace=T), 10, 10)

# creating and plottig the graph from the adjacency matrix
gr <- graph.adjacency(adj)
plot(gr)

References[edit | edit source]

Previous: Factor Analysis Index


Profiling R code

Before starting with parallel or high performance computing it is important to analyze and optimize R code. R provides some useful tools to analyze and profile R code. A good and short introduction is provided in the R extension documentation.

Soon we are going to provide some example code:


Parallel computing with R

There are many packages and tools available for parallel computing with R. A good overview is provided by the CRAN Task View: High-Performance and Parallel Computing with R and several publications:

  • State of the Art in Parallel Computing with R; Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann; Journal of Statistical Software 2009: JSS

Soon we are going to provide some code examples:


Sources

For the following resources, authors have explicitly given the permission to include their material on the R programming wikibook. Remember that even if they have given their permission, they should be correctly cited.

Blogs[edit | edit source]

  • R-statistics (the R category) (A link to a post which provides proper licence for approving this content for use).
  • GETTING GENETICS DONE - R tag. The R content is available from here: http://gettinggeneticsdone.blogspot.com/search/label/R. The R code is copyrighted under The open source BSD license (as is described here: http://gettinggeneticsdone.blogspot.com/p/copyright.html). The content itself is licensed under a Creative Commons Attribution-Share-Alike 3.0 Unported License (as is shown at the bottom of every post). Bottom line - the R code and written content can be used freely (with attribution).
  • Struggling Through Problems: http://strugglingthroughproblems.blogspot.com/search/label/R
  • Backsidesmack R-stuff category. Copyright information is in the footer and explicit permission is in this post
  • Al3xandr3: http://al3xandr3.github.com/tags/r.html
  • Cloudnumbers.com (the R category): Posts about high-performance computing and cloud computing with R. A link to a post which provides proper license for approving this content for use.
  • The R Tutorial Series (http://rtutorialseries.blogspot.com) by John M. Quick provides a collection of user-friendly guides to researchers, students, and others who want to learn how to use R for their statistical analyses. Its content is available for use in the R Programming wikibook under a Creative Commons BY-SA License.
  • Exploring Indian census data using R and useful scripts to download weather related data from websites. The content is available for use in the R wikibook under cc-sa license.
  • Plain Data Analysis tips at www.danielmarcelino.com .Topics covered in the blog are related to social sciences, but there is a great variety of them.
  • R Tutorial [5]
  • R Workshop [6]

Handouts[edit | edit source]

Index

This page provides tables which make it easy to find functions for usual tasks in statistics in R, SAS and Stata. Other software may also be included in the future such as SPSS.

Data management[edit | edit source]

Function R Stata SAS
Merge merge() merge / mmerge -
Reshape reshape() reshape -
Expand a dataset expand() (epicalc) expand -

Descriptive Statistics[edit | edit source]

Function R Stata SAS
Mean mean() mean proc means
Histogram hist() hist -
Frequency table table() ta proc freq

Regression models[edit | edit source]

Function R Stata SAS
Least Square lm() reg proc reg
GLM glm() glm proc glm
Probit models glm(, family = binomial(link="probit")) probit -
Logit models glm(, family = binomial(link="logit")) logit -
Linear fixed effects model plm( , model = "within") (plm) xtreg , fe -
Linear random effects model plm( , model = "random") (plm) xtreg , re -
Linear quantile regression rq() (quantreg) qreg -
Ordinal logistic regression polr() (MASS) ologit -
Linear IV (2sls) ivreg() (AER) ivreg proc syslin (2sls) -

Programming[edit | edit source]

Function R Stata SAS
Check some condition stopifnot() assert -