R Programming

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/R_Programming

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

Introduction

What is R ?

R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as t-test, chi-square tests, standard linear models, instrumental variables estimation, local polynomial regressions, etc. It also provides high-level graphics capabilities. There are a few minor similarities between R and C programming languages, but they both run in different ways.

Why use R?

R is free software. R is an official GNU project and distributed under the Free Software Foundation General Public License (GPL).
R is a powerful data-analysis package with many standard and cutting-edge statistical functions. See the Comprehensive R Archive Network (CRAN)'s Task Views to get an idea of what you can do with R.
R is a programming language, so its abilities can easily be extended through the use of user-defined functions. A large collection of user-contributed functions and packages can be found in CRAN's Contributed Packages.
R is widely used in political science, statistics, econometrics, actuarial sciences, sociology, finance, etc.
R is available for all major operating systems (Windows, Mac OS, GNU-Linux).
R is object-oriented. Virtually anything (e.g., complex data structures) can be stored as an R object.
R is a matrix language.
R syntax is much more systematic than Stata or SAS syntax.
R can be installed on your USB stick^[1].

Alternatives to R

S-PLUS is a commercial version of the same S programming language that R is a free version of.
Gretl is free software for econometrics. It has a graphical user interface and is nice for beginners.
SPSS is proprietary software which is often used in sociology, psychology and marketing. It is known to be easy to use.
GNU PSPP is a free-software alternative to SPSS.
SAS is proprietary software that can be used with very large datasets such as census data.
Stata is proprietary software that is often used in economics and epidemiology.
Julia is a general programming language, with capabilities similar to MATLAB, R and Python (and speed of C), and can call libraries from all those.
MATLAB is proprietary software used widely in the mathematical sciences and engineering.
Octave is free software similar to MATLAB. The syntax is the same and MATLAB code can be used in Octave.
Python is a general programming language. It includes some specific libraries for data analysis such as Pandas^[2] ·^[3].

Beginners can have a look at GNU PSPP or Gretl. Intermediate users can check out Stata. Advanced users who like matrix programming may prefer MATLAB or Octave. Very advanced users may use C or Fortran.

R programming style

R is an object oriented programming language. This means that virtually everything can be stored as an R object. Each object has a class. This class describes what the object contains and what each function does with it. For instance, plot(x) produces different outputs depending on whether x is a regression object or a vector.
The assignment symbol is "<-". Alternatively, the classical "=" symbol can be used.

The two following statements are equivalent :

 > a <- 2
 > a = 2

Arguments are passed to functions inside round brackets (parentheses).
One can easily combine functions. For instance you can directly type

mean(rnorm(1000)^2)

The symbol "#" comments to the end of the line:

 # This is a comment
 5 + 7 # This is also a comment

Commands are normally separated by a newline. If you want to put more than one statement on a line, you can use the ";" delimiter.

 a <- 1:10 ; mean(a)

You can also have one statement on multiple lines.
R is case sensitive: a and A are two different objects.
Traditionally underscores "_" are not used in names. It is often better to use dots ".". One should avoid using an underscore as the first character of an object name.

 1:10 |> mean(.)

You can also use the pipe operator |>.

How you can help

Here are some things editors do to keep this book internally consistent. If you have something to contribute, go ahead and make your contribution. Other editors can touch up your edits afterwards so that they conform to the guidelines.

The local manual of style WB:LMOS for the R programming book, including a brief explanation of why we do it that way, is:

Examples use "source" tags : <syntaxhighlight lang="rsplus"> a <- 1:10 ; mean(a) </syntaxhighlight>. That makes them look pretty to our readers.
The name of packages are in bold : '''Hmisc'''.
Name of functions are in "code" tags: <code>lm()</code>.
Page titles -- the part after "R Programming/" -- are in sentence case, like "R Programming/Working with data frames". We couldn't decide between sentence case and title case, so I flipped a coin.
Every page has <noinclude>{{R Programming/Navigation}}</noinclude> at the top and {{R Programming/Navbar|Mathematics|Probability Distributions}} at the bottom. That makes it easier to navigate from one page to another online.

References

↑ Portable R by Andrew Redd http://sourceforge.net/projects/rportable/
↑ "Python Data Analysis Library". pandas.pydata.org/. Retrieved February 14, 2013.
↑ "Getting started with Pandas". blog.kaggle.com. January 17, 2013. Retrieved February 14, 2013.

Index

Next: Sample Session

Sample Session

This page is an introduction to the R programming language. It shows how to perform very simple tasks using R. First you need to have R installed (see the Settings page). If you use Windows or Mac OS, the easiest solution is to use the R Graphical User Interface (click on its icon). If you use Linux, open a terminal and type R at the command prompt.

Usually when you open R, you see a message similar to the following in the console:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from ~/.RData]

>

You can type your code after the angle bracket >.

R can be used as a simple calculator and we can perform any simple computation.

 
> # Sample Session 
> # This is a comment
> 
> 2 # print a number
[1] 2
> 2+3 # perform a simple calculation
[1] 5
> log(2) # natural log
[1] 0.6931472

We can also store numeric or string objects using the assignment operator, <-.

> x <- 2 # store an object
> x # print this object
[1] 2
> (x <- 3) # store and print an object
[1] 3
> 
> x <- "Hello" # store a string object
> x
[1] "Hello"

We can also store vectors.

> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) #store a vector
> Height  # print the vector
 [1] 168 177 177 177 178 172 165 171 178 170
> 
> Height[2] # Print the second component
[1] 177
> Height[2:5] # Print the second, the 3rd, the 4th and 5th component
[1] 177 177 177 178
> 
> (obs <- 1:10) # Define a vector as a sequence (1 to 10)
 [1]  1  2  3  4  5  6  7  8  9 10
> 
> Weight <- c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75)
> 
> BMI <- Weight/((Height/100)^2)   # Performs a simple calculation using vectors
> BMI
 [1] 31.17914 22.98190 27.13141 16.59804 22.40879 23.32342 22.40588 20.86112
 [9] 16.09645 25.95156

We can also describe the vector with length(), mean() and var().

> length(Height)
[1] 10
> mean(Height) # Compute the sample mean
[1] 173.3
> var(Height)
[1] 22.23333

We can also define a matrix.

> M <- cbind(obs,Height,Weight,BMI) # Create a matrix
> typeof(M) # Give the type of the matrix
[1] "double"
> class(M)  # Give the class of an object
[1] "matrix"
> is.matrix(M) # Check if   M is a matrix
[1] TRUE
> is.vector(M)  # M is not a vector
[1] FALSE
> dim(M)    # Dimensions of a matrix
[1] 10  4

We can plot the data using plot().

 
> plot(Height,Weight,ylab="Weight",xlab="Height",main="Corpulence")

We can define a dataframe.

 
> mydat <- data.frame(M) # Creates a dataframe
> names(mydat) # Give the names of each variable
[1] "obs"    "Height" "Weight" "BMI"   
> str(mydat)   # give the structure of your data
'data.frame':   10 obs. of  4 variables:
 $ obs   : num  1 2 3 4 5 6 7 8 9 10
 $ Height: num  168 177 177 177 178 172 165 171 178 170
 $ Weight: num  88 72 85 52 71 69 61 61 51 75
 $ BMI   : num  31.2 23 27.1 16.6 22.4 ...
> 
> View(mydat)  # Look at your data
> 
> summary(mydat)  # Descriptive Statistics
      obs            Height          Weight           BMI       
 Min.   : 1.00   Min.   :165.0   Min.   :51.00   Min.   :16.10  
 1st Qu.: 3.25   1st Qu.:170.2   1st Qu.:61.00   1st Qu.:21.25  
 Median : 5.50   Median :174.5   Median :70.00   Median :22.70  
 Mean   : 5.50   Mean   :173.3   Mean   :68.50   Mean   :22.89  
 3rd Qu.: 7.75   3rd Qu.:177.0   3rd Qu.:74.25   3rd Qu.:25.29  
 Max.   :10.00   Max.   :178.0   Max.   :88.00   Max.   :31.18  
>

You can save an R session (all the objects in memory) and load the session.

> save.image(file="~/Documents/Logiciels/R/test.rda")
> load("~/Documents/Logiciels/R/test.rda")

We can define a working directory. Note for Windows users : R uses slash ("/") in the directory instead of backslash ("\").

> setwd("~/Desktop")            # Sets working directory (character string enclosed in "...")
> getwd()                       # Returns current working directory
[1] "/Users/username/Desktop"
> dir() * Lists the content of the working directory

There are some special characters in R

NA : Not Available (i.e. missing values)
NaN : Not a Number (e.g. 0/0)
Inf: Infinity
-Inf : Minus Infinity.

For instance 0 divided by 0 gives a NaN but 1 divided by 0 gives $+\infty$

 > 0/0
 [1] NaN
 > 1/0
 [1] Inf

We can exit R using q(). The no argument specifies that the R session is not saved.

q("no")

Manage your workspace

This page explains how to manage your workspace.

Basic functions

ls() lists the objects in your workspace.
list.files() lists the files located in the folder's workspace
rm() removes objects from your workspace; rm(list = ls()) removes them all.

rm(list=ls()) # remove all the objects in the workspace

Each object can be saved to the disk using the save() function. They can then be loaded into memory using load().

load("file.Rda")
...
# assume you want to save an object called 'df'
save(df, file = "file.Rda")

save.image() saves your workspace.

Informations about the session

sessionInfo() gives information about your session, i.e., loaded packages, R version, etc.
R.version provides information about the R version.

Memory usage

Note: According to R version 3.5.1 on Linux and Mac, memory.size() and memory.limit() are Windows-specific.

memory.size() gives the total amount of memory currently used by R.

> memory.size()
[1] 10.18

memory.limit() without any argument gives the limit of memory used by R. This can also be used to increase the limit. The maximum amount is limited by the memory of the computer.

> memory.limit()
[1] 1535
>  memory.limit(size=2000) # 2000 stands for 2000 MB
[1] 2000

object.size() returns the size of an R object. You can print the results and choose the unit (byte,kilobytes,megabytes,etc).

> a <- rnorm(10^7)
> object.size(a)
80000024 bytes
> print(object.size(a),units="b")
80000024 bytes
> print(object.size(a),units="Kb")
78125 Kb
> print(object.size(a),units="Mb")
76.3 Mb
> print(object.size(a),units="Gb")
0.1 Gb
> print(object.size(a),units="auto")
76.3 Mb

memory.profile() returns more details.

> memory.profile()
       NULL      symbol    pairlist     closure environment     promise 
          1        4959       61794        1684         255        3808 
   language     special     builtin        char     logical     integer 
      14253          46         687        5577        2889        4060 
     double     complex   character         ...         any        list 
        523           1       11503           0           0        1024 
 expression    bytecode externalptr     weakref         raw          S4 
          1           0         497         117         118         642

gc() initiates the garbage collector which causes R to free memory from objects no longer used.

> gc()
           used (Mb) gc trigger  (Mb) max used (Mb)
Ncells  1095165 58.5    1770749  94.6  1770749 94.6
Vcells 12060564 92.1   17769683 135.6 12062095 92.1

References

External links

Dumping functions from the global environment into an R script file

Index

Next: Settings

Settings

This page show how to install R, customize it and choose a working environment. Once you have installed R, you may want to choose a working environment. This can be a simple text editor (such as Emacs, Vim or Gedit), an integrated development interface (IDE) or graphical user interface (GUI). RStudio is now a popular option.

Installation

Linux

Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as to type in sudo aptitude install r-base or sudo apt-get install r-base (don't forget that this has to be done as root), or installing the package r-base using your favourite package manager, for example Synaptic.

There is also a bunch of packages extending R to different purposes. Their names begin with r-. Take a closer look at the package r-recommended. It is a metapackage that depends on a set of packages that are recommended by the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this package, as it depends on r-base.

Installation with apt-get (Debian, Ubuntu and all linux distributions based on Debian)

sudo apt-get install r-base
sudo apt-get install r-recommended

Installation with aptitude (Debian, Ubuntu and all linux distributions based on Debian)

sudo aptitude install r-base
sudo aptitude install r-recommended

Mac OS

Installation : Visit the R project website (http://r-project.org/), select the "CRAN" page and choose mirror. Download the disk image (dmg file) and install R.

The default graphical user interface for Mac is much better than the one for Windows. It includes

a dataframe manager,
a history of all commands,
a program editor which supports syntax highlighting.

Windows

(Section source ^[1])

Download

To install R under Windows operating system you have to download the binaries from the web. First go to r-project.org and click CRAN under download section on the left panel and select a mirror site, from where you could download the required content. The best idea is pick a mirror closest to your actual geographical location, but other ones should work as well. The click Windows and in subdirectories base. The windows binary is the exe file, in form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version the setup has the same steps.

Setup

As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems. However, there are few things that you can alter.

On the welcome screen click Next.
Read or just notice the GNU license, and click Next.
Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc, the default choice will be OK for you.
During the next step you can specify which parts of R you want to install. Choices are: User installation, Minimal user installation, Full installation and Custom installation. Notice the required space under the selection panel (varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
In this step you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation (see lower).
You can specify the Start menu folder.
In the next step you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and specify registry entries.

With these steps you can customize the R graphical user interface.

You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window (SDI).
You can select the style, how the Help screen is displayed in R. You will use help a lot, so this may be an important decision. It is up to you, which style you prefer. Please note, that the content of help file will be the same regardless of your choice. Here you specify just the appearance of that particular window.
In the next step you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard option here.

Update

Updating R on Windows requires several steps:

Downloading/installing the latest version of R
Copying your packages from the library folder to the one in the new R installation

Both of these steps can easily be done using the installr package, by running the following command (which would both install the package, and update R) ^[2]:

# installing/loading the package:
if(!require(installr)) { 
install.packages("installr"); require(installr)} #load / install+load installr
updateR() # updates R

There is also the possibility of using a "global" package library, see here for more details.

Portable R for Windows

You have a portable version if you want to install R on your USB stick^[3]. This is useful if you don't have admin rights on a computer. The basic installation requires something like 115 mb but you may need more if you want to install add-on packages.

Working environment

Once you have installed R, you need to choose a working environment. In this section, we review all possible working environment. This include a basic terminal as well as integrated development environment (therefore IDE), text editors or graphical user interface (therefore GUI).

A graphical user interface provides some menu which makes it possible to run R without writing code. This is a good solution for beginners.
A text editor makes it easy to write code.
An integrated development environment provides a text editor and a compiler which makes it easy to write R scripts, to run them and to correct them.

Note that there are some task specific GUIs. For instance speedR provides a GUI to import data into R.

Terminal

For Linux and Mac OS users it is possible to use R from the terminal.

$ R
> q("no") # to leave R and return to the terminal

R Gui

For Mac OS and Windows users, there is a graphical user interface. In Mac OS, the GUI includes a package manager, a program editor with syntax highlighting and a data browser. In Windows, the GUI is not better than a Terminal.

Graphical User Interface

This section includes material for beginners (eg people who are not familiar with computing).

Poor Man's GUI (pmg)

A simple GUI for learning R. It is recommanded for beginners.

> install.packages("pmg", dependencies=TRUE)
# Windows users may also run the following scripts to install required libraries
> source("http://www.math.csi.cuny.edu/pmg/installpmg.R")

> library(pmg)

Jaguar : Java GUI for R

Wikipedia has related information at Java GUI for R

Jaguar : Java GUI for R^[4] is available for Linux, Mac and Windows (screenshots).
It is good for beginners.

R commander

Rcommander^[5] developed by John Fox provides a menu in the standard Graphical User Interface (screenshots).
It works on Linux, Mac and Windows.
It is a good interface for beginners and for people who are not used to script editing.

> install.packages("Rcmdr") # installation
> library("Rcmdr") # usage

Ubuntu users can also install R Commander from the software center.

Integrated development environment

RStudio

RStudio is an integrated development interface for R^[6].

It works on Mac, Windows and Linux platforms.
It supports Sweave and LaTeX.
It includes syntax highlighting for R, LaTeX and Sweave.
It includes a way to view variables and dataframes.
It makes it easy to load and install package, to navigate in the help files and to manage your workspace.
It supports code and file name completion.
It can be installed on a USB stick.

John Verzani has written a book dedicated to this new interface, Getting Started with RStudio^[7] and Jeffrey Racine recommand RStudio for Sweave^[8].

RKward

RKward is an IDE and a GUI for Linux (KDE) (Screenshots). RKWard aims to provide an easily extensible, easy to use IDE/GUI for R. RKWard tries to combine the power of the R-language with the (relative) ease of use of commercial statistics tools.

Eclipse with StatET

Eclipse with the StatET plugin^[9] provides an IDE for R.

It supports Sweave.

Rattle GUI

Rattle^[10] for Linux, Windows and Mac (screenshots)^[11].

Tinn R

For Windows only
Tinn R^[12] is a good IDE for Windows users. One can easily define keyboard shortcuts to execute selected R code from Tinn R.

Notepad++ and NpptoR

For Windows only.

Notepad++^[13] and NPPtoR^[14] provides syntax highlighting and hotkeys (by default F8) to send lines of code to R. Syntax highlighting can be easily modified using the dialog box to manage user define languages (Menu/View/Use Define Dialog...). NPPtoR provides a method to generate syntax highlighting dynamically (depending on all the available packages in the R environment).

Vi, Vim and GVim

Wikipedia has related information at Vim

See also the Wikibook Learning the vi Editor.

Vim and GVim provides syntax highlighting
Vim is for advanced users only
The Vim-R-plugin allows the communication between Vim and R

Emacs and ESS

WinEdt

How to use R for Windows with the RWinEdt extension ? by Andy Eggers^[17]
WinEdt is not open source
WinEdt is for Windows only.
Install the RWinEdt package.

gedit with gedit-r-plugin

For Linux users only.
There is also a plugin for gedit called gedit-r-plugin. This can be installed using Synaptic or any other package manager on a linux platform.

Customizing R

R profile

R can be customized using the Rprofile file. On Linux, this file is stored in the home directory. You can edit it by running the following command in a terminal :

$ gedit ~/.Rprofile

If you use some packages very often, you can load them systematically using the Rprofile file. You can also change the default options.

Options

The function options() without any argument show all options

> options()

The linguistic and encoding options can be modified using Sys.setlocale() :

> Sys.setlocale()
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/en_US.UTF-8"

By default, error messages are in the local language. However, it is possible to set them in English using Sys.sentev()

Sys.setenv(LANGUAGE='en')

References

↑ This section was imported from the Wikiversity project Installation, How to use R course
↑ Updating R from R (on Windows) – using the {installr} package
↑ Portable R http://sourceforge.net/projects/rportable/
↑ http://jgr.markushelbig.org/JGR.html
↑ http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/
↑ rstudio.org
↑ John Verzani "Getting Started with RStudio An Integrated Development Environment for R", O'Reilly Media, September 2011
↑ Jeffrey Racine, (forthcoming), "RStudio: A Platform Independent IDE for R and Sweave," Journal of Applied Econometrics.
↑ StatET : http://www.walware.de/goto/statet
↑ Rattle : http://rattle.togaware.com/
↑ Graham J Williams. Rattle: A Data Mining GUI for R. The R Journal, 1(2):45-55, December 2009
↑ Tinn stands for Tinn Is Not Notepad http://www.sciviews.org/Tinn-R/
↑ Note that Notepad++ can be installed on a USB stick http://sourceforge.net/projects/notepadpluspe/
↑ NPPtoR is also a portable software http://sourceforge.net/projects/npptor/
↑ ESS : http://ess.r-project.org/
↑ Vincent Goulet Emacs page http://vgoulet.act.ulaval.ca/emacs
↑ http://www.people.fas.harvard.edu/~aeggers/RWinEdt_installation.pdf

Previous: Data types

Index

Next: Packages

Documentation

Obtaining Help

For each package you have a reference manual available as an HTML file from within R or as a PDF on the CRAN website. You also often have Vignettes or comprehensive articles in the R Journal, the Journal of Statistical Software, etc.

library(help="package_name")
vignette("np",package="np")
vignette(all=FALSE) # vignettes for all attached packages
vignette(all=TRUE) # vignettes for all packages on the computer

You can search for help inside all loaded packages using help() or ?. Usually you do not need to add quotes to function names, but sometimes it can be useful. args() gives the full syntax of a function.

help(lm)
?lm
?"for"
?"[["
args("lm")
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
# NULL

apropos() and find() looks for all the functions in the loaded packages containing a keyword or a regular expression^[1].

apropos("norm")
#   [1] "dlnorm"         "dnorm"          "plnorm"        
#   [4] "pnorm"          "qlnorm"         "qnorm"         
#   [7] "qqnorm"         "qqnorm.default" "rlnorm"        
#  [10] "rnorm"          "normalizePath"

You can search for help in all installed packages using help.search() or its shortcut ??.

??"lm"
help.search("covariance")

RSiteSearch() looks for help in all packages and in the R mailing lists. The sos package improves the RSiteSearch() function with the findFn() function. ??? is a wrapper for findFn().

RSiteSearch("spline")
library("sos")
findFn("spline", maxPages = 2)
???"spline"(2)

hints() in the hints package suggests what to do with an object.

fit <- lm(y ~ x)
library("hints")
hints(fit) # returns a list of function using lm objects.

Handouts

An Introduction to R The R Reference Manual
Robert Kabacoff's Quick R
Grant Farnsworth's Econometrics in R The best introduction for an economist (about 20 pages)
UCLA R Computing Resources
A Handbook of Statistical Analyses Using R by Brian S. Everitt and Torsten Hothorn
fr+en Arthur Charpentier's R for acturies
Dan Goldstein's video tutorial
fr Julien Barnier's introduction to R for sociologists
Rosetta Code presents solutions to the same task in different programming languages.
'R language for programmers', by John Cook
A Brief Guide to R Beginners in Econometrics
R Tutorial by Kelly Black

Teaching Resources

François Briatte has a nice introduction to data analysis using R^[2]
Simon Jackman Political Methodology Classes
Jonathan Katz Political Methodology Classes
A Brief Guide to R for Beginners in Econometrics
PRISM luncheons
Statistical Analysis: an Introduction using R - which includes a course on R
Biostatistics with R aka A R companion to Wayne Daniel 's Biostatistics Book

Blogs

Planet R the first R blog aggregator
R Bloggers The news pulse for the R blogosphere
"R" you Ready ?
One R Tip a Day
Revolution computing blog
Yu Sung Su's Blog:R
(fr) Freakonometrics (in French) lots of code chunks
(fr) Baptiste Coulmont (in French)
(fr) Quanti Sciences Sociales (in French) R blog for sociologists

Journals

The R Journal
Journal of Statistical Software contains lots of articles on R packages.
The Political Methodologist contains lots of articles on R for political scientists.

Books

Venables and Ripley : Modern Applied Statistics with S
- A very good introduction to R covering numerous topics.
A Handbook of Statistical Analyses Using R (Brian S. Everitt and Torsten Hothorn, Chapman & Hall/CRC, 2008)
An Introduction to Data Technologies, by Paul Murrell
- Everything you need to know about data management
A first course in statistical programming with R, John Braun and Duncan Murdoch.
Peter Dalgaard (2009). ISwR: Introductory Statistics with R. R package version 2.0-4. http://CRAN.R-project.org/package=ISwR
Springer Use R Series
John Fox : An R and S-PLUS Companion to Applied Regression
Gelman Hill : Data Analysis using Regression and Multilevel Hierarchical Models

useR and other R conferences

useR! 2009
useR! 2010
London R homepage
R / Finance conferences in 2009 and 2010

Search Engine

R seek
Google Code Search with keyword "lang:r" gives access to r programs including the request. For instance the following request optim lang:r gives access to all the r programs including optim.

Q&A / Forums

Nabble R http://r.789695.n4.nabble.com/
Stackoverflow
The #rstats hashtag on Twitter
IRC: #r@freenode
r-soc : mailing list for French sociologist

References

↑ If you want to know more about regular expressions, have a look at the Regular expressions section in the Text Processing page.
↑ Introduction to Data Analysis

Previous: Data types

Index

Next: Sample Session

Control Structures

Conditional execution

Help for programming :

> ?Control

if accepts a unidimensional condition.

> if (condition){
+     statement  
+     } 
> else{
+     alternative
+     }

The unidimensional condition may be one of TRUE or FALSE, T or F, 1 or 0 or a statement using the truth operators:

x == y "x is equal to y"
x != y "x is not equal to y"
x > y "x is greater than y"
x < y "x is less than y"
x <= y "x is less than or equal to y"
x >= y "x is greater than or equal to y"

And may combine these using the & or && operators for AND. | or || are the operators for OR.

> if(TRUE){
+     print("This is true")
+     }
  [1] "This is true"
> x <- 2  # x gets the value 2
> if(x==3){
+     print("This is true")
+     } else {
+     print("This is false")
+     }
 [1] "This is false"
> y <- 4 # y gets the value 4
> if(x==2 && y>2){
+     print("x equals 2 and y is greater than 2")
+     }
 [1] "x equals 2 and y is greater than 2"

The ifelse() command takes as first argument the condition, as second argument the treatment if the condition is true and as third argument the treatment if the condition is false. In that case, the condition can be a vector. For instance we generate a sequence from 1 to 10 and we want to display values which are lower than 5 and greater than 8.

> x <- 1:10 
> ifelse(x<5 | x>8, x, 0)
 [1]  1  2  3  4  0  0  0  0  9 10

Sets

R has some very useful handlers for sets to select a subset of a vector:

> x = runif(10)
> x<.5
 [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
> x
 [1] 0.32664759 0.57826623 0.98171138 0.01718607 0.24564238 0.62190808 0.74839301 
 [8] 0.32957783 0.19302650 0.06013694
> x[x<.5]
[1] 0.32664759 0.01718607 0.24564238 0.32957783 0.19302650 0.06013694

to exclude a subset of a vector:

> x = 1:10
> x
 [1]  1  2  3  4  5  6  7  8  9 10
> x[-1:-5]
[1]  6  7  8  9 10

Loops

Implicit loops

R has support for implicit loops, which is called vectorization. This is built-in to many functions and standard operators. for example, the + operator can add two arrays of numbers without the need for an explicit loop.

Implicit Loops are generally slow, and it is better to avoid them when it is possible.

apply() can apply a function to elements of a matrix or an array. This may be the rows of a matrix (1) or the columns (2).
lapply() applies a function to each column of a dataframe and returns a list.
sapply() is similar but the output is simplified. It may be a vector or a matrix depending on the function.
tapply() applies the function for each level of a factor.

> N <- 10
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> male <- rbinom(N,1,.48)
> y <- 1 + x1 + x2 + male + rnorm(N)
> mydat <- data.frame(y,x1,x2,male)
> lapply(mydat,mean) # returns a list
$y
[1] 3.247

$x1
[1] 0.1415

$x2
[1] 1.29

$male
[1] 0.5

> sapply(mydat,mean) # returns a vector
     y     x1     x2   male 
3.2468 0.1415 1.2900 0.5000 
> apply(mydat,1,mean) # applies the function to each row
 [1]  1.1654  2.8347 -0.9728  0.6512 -0.0696  3.9206 -0.2492  3.1060  2.0478  0.5116
> apply(mydat,2,mean) # applies the function to each column
     y     x1     x2   male 
3.2468 0.1415 1.2900 0.5000 
> tapply(mydat$y,mydat$male,mean) # applies the function to each level of the factor
    0     1 
1.040 5.454

See also aggregate() which is similar to tapply() but is applied to a dataframe instead of a vector.

Explicit loops

R provides three ways to write loops: for, repeat and while. The for statement is excessively simple. You simply have to define index (here k) and a vector (in the example below the vector is 1:5) and you specify the action you want between braces.

> for (k in 1:5){
+ print(k)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

When it is not possible to use the for statement, you can also use break or while by specifying a breaking rules. One should be careful with this kind of loops since if the breaking rules is misspecified the loop will never end. In the two examples below the standard normal distribution is drawn in as long as the value is lower than 1. The cat() function is used to display the present value on screen.

> repeat { 
+ 	g <- rnorm(1) 
+ 	if (g > 1.0) break 
+ 	cat(g,"\n")
+ 	} 
-1.214395 
0.6393124 
0.05505484 
-1.217408 
> g <- 0
> while (g < 1){
+ 	g <- rnorm(1) 
+ 	cat(g,"\n")
+ 	}
-0.08111594 
0.1732847 
-0.2428368 
0.3359238 
-0.2080000 
0.05458533 
0.2627001 
1.009195

The next statement can be used to discontinue one particular cycle and skip to the “next”.

> for (k in 1:10) { 
+   if(k==8) {
+     print("skipped")
+     next
+   }
+   print(k)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] "skipped"
[1] 9
[1] 10

Iterators

Loops in R are generally slow. iterators may be more efficient than loops. See this entry in the Revolution Computing Blogs

References

Previous: Random Number Generation

Index

Next: Data Management

Working with functions

Looking at the code of a function

You can type the name of the function in the console without any round brackets after the name. This will print the code of the function in the console.
You can also use the page() function which opens a new editor window and prints the code of the function in this editor.
You can also use the trCopy() function in the TinnR package to copy the code of the function. Then you just have to paste it in a text editor to have a look at it.

Here is an example with the lm() function.

> lm
> page(lm)
> library(TinnR)
> trCopy(lm)

Creating your own function

A simple function without argument which doesn't return anything

> fn <- function(){
+ print("hello")
+ }
> fn()
[1] "hello"

Returning an object

By default the value of the last line (*) is returned. In the following example, we have a simple function with two objects. The last one is returned.

> test <- function() {
+ x <-1
+ z <- 2
+ }
> res <- test()
> res
[1] 2

The function can return an object explicitly using return() (but as it is the last line, you could simply use x instead):

> test <- function() {
+ x <- 1
+ z <- 2
+ return(x)
+ }
> res <- test()
> res
[1] 1

) More precisely, it is not the "last line" but rather the value of the last evaluation which is returned from the function.

Adding arguments

It is possible to add arguments.

square <- function(x){
	x2 <- x^2
	return(x2)
	}
square(x = 2)

Note that the above function would rather be written (and be more efficient) as

square <- function(x) x^2

(as the last value is returned)

The ... argument means that you can add other arguments which will be passed to functions inside the function.

plot2 <- function(x,...){
	plot(x, type = "l", ...)
	}
plot2(runif(100), main = "line plot", col = "red")

It is possible to add a dataframe as argument^[1]. Here is an example :

redplot <- function(z, y, data, env=parent.frame()) {
       if(!missing(data)){
			z <- data[,deparse(substitute(z))]
			y <- data[,deparse(substitute(y))]
			}
	plot(z,y, col = "red", pch = 15)
} 

mydat <- data.frame(vm = rnorm(10),output = rnorm(10))
redplot(vm,output,data=mydat)

For estimation commands it is possible to add formulas as arguments. For instance, we can create our own function for ordinary least square using a formula interface.

ols <- function(formula, data = list()) {
	mf <- model.frame(formula=formula, data=data)
	X <- model.matrix(attr(mf, "terms"), data=mf)
	y <- model.response(mf)
	beta <- solve(t(X)%*%X)%*%t(X)%*%y
	se <- sqrt( 1/(nrow(X) - ncol(X)) * sum((y - X%*%beta)^2) * diag(solve(t(X)%*%X)))
	res <- cbind(beta,se)
	colnames(res) <- c("Coefficients","Standard errors")
	res
}
N <- 100
u <- rnorm(N)
x <- rnorm(N) + 1
y <- 1 + x + u
ols(y~x)

Recursive functions

R supports recursive functions. The function below computes Fibonacci numbers recursively.

> fib <- function(n) {
              if(n > 2) {
                   m <- fib(n-1)
                   c(m, sum(tail(m, 2)))
                   }
              else rep(1, n)
              }
> fib(30)
 [1]      1      1      2      3      5      8     13     21     34     55
[11]     89    144    233    377    610    987   1597   2584   4181   6765
[21]  10946  17711  28657  46368  75025 121393 196418 317811 514229 832040

Functions as Objects

R functions can be treated as objects

> a <- function(n) function(a) runif(a)
> b <- a(1)
> b(10)
 [1] 0.8726873 0.9512367 0.5971435 0.5540743 0.6378967 0.4030071 0.2750673 0.1777123 0.6960378 0.3969920

This can be useful when wanting to make many different kinds of functions

> a <- list()
> b <- function(i){ i; function() runif(i)}
> for (i in 1:10) a[[i]] <- b(i)
> a[[1]]()
[1] 0.2617396
> a[[2]]()
[1] 0.8822248 0.3374574
> a[[3]]()
[1] 0.0348156 0.4212788 0.6107646

Higher-order functions

You can use higher-order functions in R. Contrary to common belief, using them instead of loops, is not faster, because the apply function has a for-loop inside its definition. Use them only to improve clarity of your code.^[2]

apply

apply is the most basic of R's map functions. lapply, sapply and mapply are convenient interfaces for apply that work on lists, vectors and multiple vectors respectively.

apply takes as arguments an array, a vector of the dimension to map along and a function. The following example is based on the apply documentation. It uses apply to compute column and row sums of a matrix.

x <- matrix(round(rnorm(100)),10,10)
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)

tapply

tapply is similar to apply, but applies a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

> x1 <- rnorm(10)
> x2 <- sample(1:2, 10, replace = T)
> cbind(x1,x2)
              x1 x2
 [1,] -1.7905021  1
 [2,]  1.2908169  2
 [3,] -2.1902513  2
 [4,]  0.4845488  1
 [5,]  0.2281593  1
 [6,]  0.2201302  1
 [7,]  2.1574243  1
 [8,]  0.5789705  2
 [9,]  1.3315188  1
[10,] -1.0029822  2
> tapply(x1, x2, sum)
        1         2 
 2.631279 -1.323446

Reduce

This function from the Reduce documentation cumulatively adds

> cadd <- function(x) Reduce("+", x, accumulate = TRUE)
> cadd(1:10)
 [1]  1  3  6 10 15 21 28 36 45 55

References

↑ http://r.789695.n4.nabble.com/Optional-data-argument-for-a-function-tp850247p850247.html
↑ Patrick Burns, The R Inferno, p. 24

Debugging

Some basic tips

Use print() statements in your functions to print variable values. Although this technique is considered low-tech or even old fashioned by some, it can still be a quick and easy way to trace an error.
Place a browser() statement in the function just before the crashing line. When the function is called, it will be executed up to the browser() line. The command-line interface then switches to the function environment, so that all variables in the function can be inspected or changed. See below for commands available in browser() mode.

Tracing errors with `traceback()`

A standard error message in R will tell you which function threw the error. Consider as an example the following function whose sole purpose is to throw an error.

myFun <- function(){
    stop("Woops! An error")
}

A call to myFun() gives

> myFun()
Error in myFun() : Woops! An error

After an error is raised, the traceback() function allows you to show the call stack leading to the error. For example, the function below calls myFun.

myFun2 <- function(){
    myFun()   
}

Calling myFun2() and traceback() gives

> myFun2()
Error in myFun() : Woops! An error
> traceback()
3: stop("Woops! An error")
2: myFun()
1: myFun2()

The traceback() function can be executed automatically each time an error is raised with the option

options(error=traceback)

It may be switched off again with

options(error=NULL)

Executing code line by line

A function can be executed by setting it to debugging mode with

debug(FUNCTION_NAME)

.

Then, when the function is called, and a browser in that function's environment is opened so that it can be executed line by line. In the debugging browser, apart from all standard R functionality, the following commands are available.

Command	Meaning
`n`	Advance to next step. An empty line also works.
`c`, `cont`	Continue to the end of the current context. E.g. to the end the loop within a loop or to the end of the function.
`where`	Print the stack of function calls (where are you?)
`Q`	Exit the browser and return to the top-level R prompt.

Debugging can be switched off with

undebug(FUNCTION_NAME)

There are a few related functions as well:

debugonce() Switch off debugging after the first call.
isdebugged() Check if a function is in degugging mode.

Browsing the call stack

This is the most advanced debugging option in R base. By setting options(error=recover) you get the opportunity to browse any environment in the call stack. For example,

> options(error=recover)
> myFun2()
Error in myFun() : Woops! An error

Enter a frame number, or 0 to exit   

1: myFun2()
2: myFun()

Selection:

By typing '1' or '2' behind Selection: the browser will jump to the selected environment. Once in the browser, all standard R functionality is at your disposal, as well as the commands in the table below.

Command	Meaning
`c`, `cont`	Exit the browser and continue at the next statement. An empty line will do the same.
`n`	Enter the step-through debugger (this changes the meaning of `c`)
`where`	Print a stack trace of active function calls (where are you in the stack?).
`Q`	Exit the browser, do not continue at the next statement but go back to the top-level R browser.

Recovery mode can be switched off by

options(error=NULL)

Using C or Fortran

For some tasks, R can be slow. In that case, it is possible to write a program in C or Fortran and to use it from R. This page is for advanced programmers only.

See wikiversity Connecting Fortran and R
Link C with R

References

Utilities

This page includes material about some utilities. Most of the functions presented here have nothing to do with statistical analysis but may be useful when working on a project. Many functions are just similar to standard unix functions.

System (Unix/DOS)

system() gives access to the system (DOS or unix). The option wait=FALSE means that you don't ask R to wait that the task is finished.

Some examples :

You can convert an image from to PS to PNG using the unix convert function of your computer. If you want to know more about this function, open a Terminal application and type man convert (This should work on Mac OS and Linux).
You can open Stata and run a program.
You can run pdflatex from R and directly open the pdf in a pdf browser.

system("convert W:/toto.ps W:/toto.png") # converts toto.ps to toto.png
system("D:/Stata10/stata.exe do D:/pgm.do", wait = F) # opens Stata and run pgm.do
system("pdflatex.exe -shell-escape file.tex") # runs pdflatex
system("open file.pdf") # opens the pdf
system("open M:/.../doc/*.pdf") # opens all the pdf in a directory

See also sys() in the Hmisc package, shell() and shell.exec().

File Handling

dir() lists all the files in a directory. It is similar to the Unix function ls. dir.create() creates a new directory. It is similar to mkdir in Unix.

file.info() gives information about a file.

> file.info("taille.txt")
           size isdir mode               mtime               ctime               atime exe
taille.txt  444 FALSE  666 2009-06-26 12:25:44 2009-06-26 12:25:43 2009-06-26 12:25:43  no

Removing files with a specific pattern :

file.remove(dir(path="directoryname", pattern="*.log"))

file.edit() opens a file in the text editor.
file.show() opens a file in a new window.
tempfile() creates a temporary file.

getZip() in the Hmisc package.

Internet

browseURL() opens an URL using an internet browser. download.file() download a file from the internet.

> browseURL("http://en.wikibooks.org/wiki/R_Programming")

To see the default browser, use getOption()

getOption("browser")

We can change the default browser using the options() command. It is safer to store the options before.

oldoptions <- options() # save the options
options(browser = "D:/FramafoxPortable/FramafoxPortable.exe")

You can download a file from the internet using download.file(). Note that very often you don't need to download a file from the internet and you can directly load it into R from the internet using standard functions. For instance, if you want to read a text file from the internet, you can use read.table(), scan() or readLines().

# For example, we download "http://en.wikibooks.org/wiki/R_Programming/Text_Processing" on our Desktop
download.file(url="http://en.wikibooks.org/wiki/R_Programming/Text_Processing",destfile= "~/Desktop/test_processing.html")
# You can also read it into R using readLines()
text <- readLines("http://en.wikibooks.org/wiki/R_Programming/Text_Processing")

Computing time

If you perform computer intensive task you may want to optimize the computing time. Two functions are available system.time() and proc.time(). Both returns a vector of values. The first is the standard CPU time.

> system.time(x<-rnorm(10^6))
[1] 1.14 0.07 1.83 0.00 0.00

> debut <- proc.time()
> x <- rnorm(10^6)
> proc.time()-debut
[1]  1.66  0.10 10.32  0.00  0.00

Computing process

user.prompt() (Zelig) makes a pause in the computation process (useful if you want to do a demo). waitReturn() (cwhmisc) does the same job. Sys.sleep() stop the computation during a few seconds.

> user.prompt()

Press <return> to continue: 
> Sys.sleep(5)

It is possible to stop the computing process if a logical condition is not true using stopifnot().

Miscellanous

trCopy() (TinnR package) copy an object to the clipboard. It is useful if you want to copy a large object to the clipboard. For instance, if you want to copy the code of a function and paste it in a text editor.

> trCopy(lm)
[1] TRUE

sessionInfo() gives information on the current session info (R version + loaded packages). This function may be useful for reproducible computing. getRversion() gives the current R version. R.version gives more details about the computer and R.Version() returns the same informations as a list.

References

↑ Henrik Bengtsson (2009). R.utils: Various programming utilities. R package version 1.1.7. http://CRAN.R-project.org/package=R.utils

Estimation utilities

This page deals with methods which are available for most estimation commands. This can be useful for all kind of regression models.

Formulas

Most estimation commands use a formula interface. The outcome is left of the ~ and the covariates are on the right.

y ~ x1 + x2

It is easy to include multinomial variable as predictive variables in a model. If the variable is not already a factor, one just need to use the as.factor() function. This will create a set of dummy variables.

y ~ as.factor(x)

For instance, we can use the Star data in the Ecdat package :

library("Ecdat")
data(Star)
summary(lm(tmathssk ~ as.factor(classk), data = Star))

I() takes arguments "as is". For instance, if you want to include in your equation a modified variable such as a squarred term or the addition of two variables, you may use I().

lm(y ~ x1 + I(x1^2) + x2)
lm(y ~ I(x1 + x2))
lm(I(y-100) ~ I(x1-100) + I(x2 - 100))

It is easy to include interaction between variables by using : or *. : adds all interaction terms whereas * adds interaction terms and individual terms.

lm(y~x1:x2) # interaction term only
lm(y~x1*x2) # interaction and individual terms

It is also possible to generate polynomials using the poly() function with option raw = TRUE.

lm(y ~ poly(x, degree = 3, raw = TRUE))

There is also an advanced formula interface which is useful for instrumental variables models and mixed models. For instance ivreg() (AER) uses this advanced formulas interface. The instrumental variables are entered after the |. See the Instrumental Variables section if you want to learn more.

library("AER")
ivreg(y ~ x | z)

Output

In addition to the summary() and print() functions which display the output for most estimation commands, some authors have developed simplified output functions. One of them is the display() function in the arm package. Another one is the coefplot() in the arm package which displays the coefficients with confidence intervals in a plot. According to the standards defined by Nathaniel Beck^[1], Jeff Gill developped graph.summary()^[2]. This command does not show useless auxiliary statistics.

R code

Output

source("http://artsci.wustl.edu/~jgill/Models/graph.summary.R")
N <- 1000
u <- rnorm(N)
x1 <- 1 + rnorm(N)
x2 <- 1 + rnorm(N) + x1
y <- 1 + x1 + x2 + u
graph.summary(lm(y ~ x1 + x2))

Family: gaussian
Link function: identity

             Coef Std.Err. 0.95 Lower 0.95 Upper CIs:ZE+RO
(Intercept) 0.980    0.056      0.871      1.089      |o| 
x1          1.040    0.043      0.955      1.125      |o| 
x2          0.984    0.031      0.923      1.045      |o| 

N: 1000    Estimate of Sigma: 0.998

library("arm")
display(lm(y ~ x1 + x2))

lm(formula = y ~ x1 + x2)
            coef.est coef.se
(Intercept) 0.89     0.05   
x1          1.05     0.04   
x2          1.02     0.03   
---
n = 1000, k = 3
residual sd = 0.96, R-Squared = 0.86

Weights

Tests

Confidence intervals

Delta Method

If you want to know the standard error of a transformation of one of your parameter, you need to use the delta method
deltamethod() in the msm package^[3].
delta.method() in the alr3 package.
deltaMethod in the car package.

Zelig : the pseudo-bootstrap method

Zelig^[4] is a postestimation package which simulates in the distribution of the estimated parameters and computes the quantities of interest such as marginal effects or predicted probabilities. This is especially useful for non-linear models. Zelig comes with a set of vignettes which explain how to deal with each kind of model. There are three commands.

zelig() estimates the model and draws from the distribution of estimated parameters.
setx() fixes the values of explanatory variables.
sim() computes the quantities of interest.

References

↑ Nathaniel Beck "Making regression and related output more helpful to users" The Political Methodologist 2010 http://politics.as.nyu.edu/docs/IO/2576/beck_tpm_edited.pdf
↑ Jeff Gill graph.summary() http://artsci.wustl.edu/~jgill/Models/graph.summary.s
↑ See the example on the UCLA Statistics webpage : http://www.ats.ucla.edu/stat/r/faq/deltamethod.htm
↑ Kosuke Imai, Gary King and Olivia Lau (2009). Zelig: Everyone's Statistical Software. R package version 3.4-5. http://CRAN.R-project.org/package=Zelig

Packages

An R package includes a set of functions and datasets. Packages are often developed as supplementary material to books. For instance the MASS package was developed by Venables and Ripley for their book Modern Applied Statistics with S and the car package was developed by John Fox for his book An R and S plus Companion to Applied Regression.

Load a package

A package is loaded into the current R environment using the library() function. A list of functions and datasets included in a package can be obtained by using the h or help argument of the library function.

library("stats4") # loads the package "stats4"
library(h=stats4) # gives help for all functions
data(package="stats4") # gives the list of all available datasets

A package can be detached from the current environment by using the detach() function:

> detach("package:prettyR")

Without any arguments the library() function lists all of the packages currently available to the user. env() (gdata) describe all loaded environments (ie packages). search() gives the list of all loaded packages.

> library() # returns the description of all the packages available on the computer
> dir(.libPaths()) # returns the name of all the packages available on the computer (quicker than the previous one)
> search()
> env(unit="MB")

current.packages() (Zelig) show all the required and suggested packages.

> current.packages("sem")

Where are my packages stored?

The .libPaths() function without arguments prints the library directories
The .libPaths() function with a directory as argument defines a new directory where to store new libraries.

> .libPaths()
[1] "/Users/username/Library/R/library"
[2] "/Library/Frameworks/R.framework/Resources/library"
> .libPaths("W:/AppData/R/library")

Install new packages

Each major distribution of R includes a 'base' set of packages which support many basic statistical functions.
Many R Users also choose to install additional 'Add-on' packages to provide simplified interfaces to R commands or to add specialist functionality i.e. the ggplot Grammar of Graphics package provides an advanced graphical output capability.
The exhaustive list of all available packages is on the CRAN website.
The R community has developed a vast resource of Add-on packages, some with unique functionality, some with overlapping functionality. It is therefore common to find multiple R packages capable of completing the same task i.e. reading and writing Excel spreadsheets. Ultimately which package to use is your choice.
To install a new package, it is usually necessary to specify the name of the package as an argument of install.packages() function.
Sometimes you need to specify more options. For instance, this is the case if you are not an administrator of your computer.
- "lib" specifies the directory where you want to store the package.
- "repos" specifies a list of repositories. Note that you can specify a vector of repositories.
- "dep=T" specifies that all the required packages are also downloaded and installed.

> install.packages("faraway")
> install.packages("rgrs", lib="W:/AppData/R/library" , 
repos=c("http://r-forge.r-project.org","http://cran.fr.r-project.org/"), 
dep=TRUE)

Stay up to date.

If you want to be aware of the latest packages, type new.packages() in R or visit the Revolution Computing Blog which gives each month a list of the new and the updated packages.

> new.packages() # displays all the packages available in the repositories
> update.packages() # updates all the packages installed with the newest version available in the repositories

We can also install bundles of packages using install.views() or update.views() (ctv).

> install.packages("ctv")
> library("ctv")
> install.views("Econometrics")
> update.views("Econometrics")

We can also remove packages with remove.packages().

Package Documentation and Help

All R packages install with 'help' documentation, listing their functions and providing syntax and usage examples.

> library("tidyr") # load the tidyr package
> help("tidyr")    # view the tidyr package's help documentation

See the Obtaining Help Documentation section for more details on accessing package 'help' documentation.

Package Dependencies

Most R packages have dependencies or references to other R packages. You must have all of an R package's 'required' dependencies installed, before you can use the package.
R package dependencies come in two types, required and suggested.
Specialist R packages such as the ggplot Grammar of Graphics packages have large package dependency trees.
The install.packages() function will automatically download and install a package and its dependencies, on a computer with an Internet connection.
The R CMD INSTALL utility will check preinstalled packages for dependencies, but not download missing packages.
Users must follow separate package download and installation processes when working on a computer with no Internet connection. The miniCRAN package can be used to assist in the offline management of R package dependencies.

Building R Packages

You can write down your own R packages. But, all packages submitted to CRAN (or Bioconductor) must follow specific guidelines, including the folder structure of the package and the other files like DESCRIPTION, NAMESPACE and so on.

See Friedrich Leisch's introduction (PDF 20 pages)^[1]
See also Duncan Murdoch's tools for building packages using Windows^[2]
See also Hadley Wickham and Jennifer Bryan's online book on current packaging practices (R Packages) ^[3]

References

↑ Friedrich Leisch Creating R Packages : A Tutorial http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf
↑ http://www.r-project.org/conferences/useR-2008/slides/Murdoch.pdf
↑ Hadley Wickham and Jennifer Bryan R Packages : Organize, Test, Docment and Share your code https://r-pkgs.org/

Previous: Settings

Index

Next: Documentation

Data types

Vectors are the simplest R objects, an ordered list of primitive R objects of a given type (e.g. real numbers, strings, logicals). Vectors are indexed by integers starting at 1. Factors are similar to vectors but where each element is categorical, i.e. one of a fixed number of possibilities (or levels). A matrix is like a vector but with a specific instruction for the layout such that it looks like a matrix, i.e. the elements are indexed by two integers, each starting at 1. Arrays are similar to matrices but can have more than 2 dimensions. A list is similar to a vector, but the elements need not all be of the same type. The elements of a list can be indexed either by integers or by named strings, i.e. an R list can be used to implement what is known in other languages as an "associative array", "hash table", "map" or "dictionary". A dataframe is like a matrix but does not assume that all columns have the same type. A dataframe is a list of variables/vectors of the same length. Classes define how objects of a certain type look like. Classes are attached to object as an attribute. All R objects have a class, a type and a dimension.

> class(object)
> typeof(object)
> dim(object)

Vectors

You can create a vector using the c() function which concatenates some elements. You can create a sequence using the : symbol or the seq() function. For instance 1:5 gives all the number between 1 and 5. The seq() function lets you specify the interval between the successive numbers. You can also repeat a pattern using the rep() function. You can also create a numeric vector of missing values using numeric(), a character vector of missing values using character() and a logical vector of missing values (ie FALSE) using logical()

> c(1,2,3,4,5)
[1] 1 2 3 4 5
> c("a","b","c","d","e")
[1] "a" "b" "c" "d" "e"
> c(T,F,T,F)
[1]  TRUE FALSE  TRUE FALSE

> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(1,5)
[1] 1 2 3 4 5
> seq(1,5,by=.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> rep(1,5)
[1] 1 1 1 1 1
> rep(1:2,5)
 [1] 1 2 1 2 1 2 1 2 1 2
> numeric(5)
[1] 0 0 0 0 0
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(5)
[1] "" "" "" "" ""

The length() computes the length of a vector. last() (sfsmisc) returns the last element of a vector but this can also be achieved simply without the need for an extra package.

x <- seq(1,5,by=.5)    # Create a sequence of number
x                      # Display this object
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> length(x)            # Get length of object x
 [1] 9
> library(sfsmisc)
> last(x)              # Select the last element of x  
 [1] 5.0
> x[length(x)]         # Select the last element wihout an extra package.
 [1] 5.0

Factors

factor() transforms a vector into a factor. A factor can also be ordered with the option ordered=T or the function ordered(). levels() returns the levels of a factor. gl() generates factors. n is the number of levels, k the number of repetition of each factor and length the total length of the factor. labels is optional and gives labels to each level.

Factors can be most easily thought of as categorical variables. An important function for factor analysis is the table() function, which offers a type of summary. When considering the types of statistical data (nominal, ordinal, interval and ratio), factors can be nominal, ordinal or interval. Nominal factors are categorical names, examples of which could be country names paired with some other information. An example of an ordinal factor would be a set of race times for a particular athlete paired with the athlete's finishing place (first, second, ...). When trying to summarize this factor, please see the example with ordinal examples below for an example on self-ordering your factors. Finally, an example of interval level factors would be age brackets such as "20 - 29", "30 - 39", etc. In general, R can automatically order numbers stored as factors appropriately but a programmer may use the same techniques with this type of data to order in the manner most appropriate to their application.

See also is.factor(), as.factor(), is.ordered() and as.ordered().

 
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes   no    yes   maybe maybe no    maybe no    no   
Levels: maybe no yes
> 
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"), ordered = T)
[1] yes   no    yes   maybe maybe no    maybe no    no   
Levels: maybe < no < yes
> 
> ordered(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes   no    yes   maybe maybe no    maybe no    no   
Levels: maybe < no < yes
>
> ordered(as.factor(c("First","Third","Second","Fifth","First","First","Third")),
+ levels = c("First","Second","Third","Fourth","Fifth"))
[1] First  Third  Second Fifth  First  First  Third 
Levels: First < Second < Third < Fourth < Fifth
>
>  gl(n=2, k=2, length=10, labels = c("Male", "Female")) # generate factor levels
 [1] Male   Male   Female Female Male   Male   Female Female Male   Male  
Levels: Male Female

Matrix

If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by column (the default option). Here are two examples.

> matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
     [,1] [,2] [,3] [,4] [,5]
[1,]   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA
[3,]   NA   NA   NA   NA   NA
[4,]   NA   NA   NA   NA   NA
[5,]   NA   NA   NA   NA   NA

> matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]    1    2    3    4    5
[5,]    6    7    8    9   10

Functions cbind() and rbind() combine vectors into matrices in a column by column or row by row mode:

> v1 <- 1:5
> v2 <- 5:1
> v2
[1] 5 4 3 2 1
> cbind(v1,v2)
     v1 v2
[1,]  1  5
[2,]  2  4
[3,]  3  3
[4,]  4  2
[5,]  5  1

> rbind(v1,v2)
   [,1] [,2] [,3] [,4] [,5]
v1    1    2    3    4    5
v2    5    4    3    2    1

The dimension of a matrix can be obtained using the dim() function. Alternatively nrow() and ncol() returns the number of rows and columns in a matrix:

> X <- matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
> dim(X)
[1] 5 5
> nrow(X)
[1] 5
> ncol(X)
[1] 5

Function t() transposes a matrix:

> t(X)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11    1    6
[2,]    2    7   12    2    7
[3,]    3    8   13    3    8
[4,]    4    9   14    4    9
[5,]    5   10   15    5   10

Unlike data frames matrices must either be numeric or character in type:

> a=matrix(2,2,2)
> a
     [,1] [,2]
[1,]    2    2
[2,]    2    2
> a = rbind(a,c("A","A"))
> a
     [,1] [,2]
[1,] "2"  "2" 
[2,] "2"  "2" 
[3,] "A"  "A"

Arrays

An array is composed of n dimensions where each dimension is a vector of R objects of the same type. An array of one dimension of one element may be constructed as follows.

> x <- array(c(T,F),dim=c(1))
> print(x)
[1] TRUE

The array x was created with a single dimension (dim=c(1)) drawn from the vector of possible values c(T,F). A similar array, y, can be created with a single dimension and two values.

> y <- array(c(T,F),dim=c(2))
> print(y)
[1]  TRUE FALSE

A three dimensional array - 3 by 3 by 3 - may be created as follows.

> z <- array(1:27,dim=c(3,3,3))
> dim(z)
[1] 3 3 3
> print(z)
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

R arrays are accessed in a manner similar to arrays in other languages: by integer index, starting at 1 (not 0). The following code shows how the third dimension of the 3 by 3 by 3 array can be accessed. The third dimension is a 3 by 3 array.

> z[,,3]
     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

Specifying two of the three dimensions returns an array on one dimension.

> z[,3,3]
[1] 25 26 27

Specifying three of three dimension returns an element of the 3 by 3 by 3 array.

> z[3,3,3]
[1] 27

More complex partitioning of array may be had.

> z[,c(2,3),c(2,3)]
, , 1

     [,1] [,2]
[1,]   13   16
[2,]   14   17
[3,]   15   18

, , 2

     [,1] [,2]
[1,]   22   25
[2,]   23   26
[3,]   24   27

Arrays need not be symmetric across all dimensions. The following code creates a pair of 3 by 3 arrays.

> w <- array(1:18,dim=c(3,3,2))
> print(w)
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

Objects of the vectors composing the array must be of the same type, but they need not be numbers.

> u <- array(c(T,F),dim=c(3,3,2))
> print(u)
, , 1

      [,1]  [,2]  [,3]
[1,]  TRUE FALSE  TRUE
[2,] FALSE  TRUE FALSE
[3,]  TRUE FALSE  TRUE

, , 2

      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE  TRUE
[3,] FALSE  TRUE FALSE

Lists

A list is a collection of R objects. list() creates a list. unlist() transform a list into a vector. The objects in a list do not have to be of the same type or length.

> x <- c(1:4)
> y <- FALSE
> z <- matrix(c(1:4),nrow=2,ncol=2)
> myList <- list(x,y,z)
> myList
 [[1]]
[1] 1 2 3 4

 [[2]]
[1] FALSE

 [[3]]
     [,1] [,2]
[1,]    1    2
[2,]    3    4

lists have very flexible methods for reference

by index number:

> a <- list()
> a
list()
> a[[1]] = "A"
> a
[[1]]
[1] "A"

> a[[2]]="B"
> a
[[1]]
[1] "A"

[[2]]
[1] "B"

By name:

> a
list()
> a$fruit = "Apple"
> a
$fruit
[1] "Apple"

> a$color = "green"
> a
$fruit
[1] "Apple"

$color
[1] "green"

This can also be recursive and in combination

> a = list()
> a[[1]] = "house"
> a$park = "green's park"
> a
[[1]]
[1] "house"

$park
[1] "green's park"


> a$park = "green's park"
> a[[1]]$address = "1 main st."

> a
[[1]]
[[1]][[1]]
[1] "house"

[[1]]$address
[1] "1 main st."


$park
[1] "green's park"

Using the scoping rules in R one can also dynamically name and create list elements

>  a <- list()
>  n <- 1:10
>  fruit = paste("number of coconuts in bin",n)
> my.number = paste("I have",10:1,"coconuts")
> for (i in 1:10)a[fruit[i]] = my.number[i]
>  a$'number of coconuts in bin 7'
[1] "I have 4 coconuts"

Data Frames

A dataframe has been referred to as "a list of variables/vectors of the same length". In the following example, a dataframe of two vectors is created, each of five elements. The first vector, v1, is composed of a sequence of the integers 1 through 5. A second vector, v2, is composed of five logical values drawn of type T and F. The dataframe is then created, composed of the vectors. The columns of the data frame can be accessed using integer subscripts or the column name and the $ symbol.

> v1 <- 1:5
> v2 <- c(T,T,F,F,T)
> df <- data.frame(v1,v2)
> print(df)
  v1    v2
1  1  TRUE
2  2  TRUE
3  3 FALSE
4  4 FALSE
5  5  TRUE
> df[,1]
 [1] 1 2 3 4 5
> df$v2
 [1] TRUE TRUE FALSE FALSE TRUE

The dataframe may be created directly. In the following code, the dataframe is created - naming each vector composing the dataframe as part of the argument list.

> df <- data.frame(foo=1:5,bar=c(T,T,F,F,T))
> print(df)
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE
5   5  TRUE

External links

data.frame objects in R (a sample chapter from the “R in Action” book)
Aggregation and Restructuring of data.frame objects (a sample chapter from the “R in Action” book)

Previous: Sample Session

Index

Next: Settings

Working with data frames

In this section, we deal with methods to read, manage and clean-up a data frame.

In R, a dataframe is a list of vectors of the same length. They don't have to be of the same type. For instance, you can combine in one dataframe a logical, a character and a numerical vector.

Reading and saving data

If data are already in an R format (.Rda or .Rdata), you can load them in memory using load(). You can save data to the R format using save().

load("mydata.Rda")
save(list='mydata',file="mydata.Rda")

Example Datasets

Most packages include example datasets to test the functions.
The data() function without argument gives the list of all example datasets in all the loaded packages.
If you want to load them in memory, you just need to use the data function and include the name of the dataset as an argument.
str_data() (sfsmisc) gives the structure of all datasets in a package.

> data() # lists all the datasets in all the packages in memory
> data(package="datasets") # lists all the datasets in the "datasets" package
> data(Orange) # loads the orange dataset in memory
> ?Orange # Help for the "Orange" Datasets
> str_data("datasets") # gives the structure of all the datasets in the datasets package.

Some packages include lots of datasets.
- The datasets package
- The AER package ^[1] includes replication datasets for some important textbooks in econometrics.
- The EcDat package ^[2] includes replication archive for the Journal of Applied Econometrics, the Journal of Business and Economic Statistics, etc.

Building your own data frames

You can create a dataframe using vectors.

N <- 100
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- 1 + x1 + x2 + u
mydat <- data.frame(y,x1,x2)

R has a spreadsheet-style data editor. One can use it to enter the data into a spreadsheet.

mydat <- edit(data.frame())

Read table from the clipboard :

> mydat <- read.table("clipboard")

You can also read space delimited tables in your code using gsource() (Zelig). Here is an example with Yule 1899 data.^[3]

mydat <- gsource(var.names = "id union pauperism out old  pop", 
variables = "
1 Kensington 27 5 104 136
2 Paddington  47 12 115 111
3 Fulham 31 21 85 174
")

You can change the column names for a dataFrame.

c1 <- c('A','B','C')
c2 <- c('Alpha','Bravo','Charlie')
c3 <- c('1','2','3')
mydf <- data.frame(c1,c2,c3)
colnames(mydf) <- c('ColName1','ColName2','ColName3')

Describing a data frame

There are various ways to inspect a data frame, such as:

str(df) gives a very brief description of the data
names(df) gives the name of each variable
summary(df) gives some very basic summary statistics for each variable
head(df) shows the first few rows
tail(df) shows the last few rows.

Browsing data

You can browse your data in a spreadsheet using View(). Depending on your operating system, this option is not always available and the result is not always the same.
You can print the first lines using head() and the last lines using tail().

View(mydata)
head(mydata, n = 20) # n = 20 means  that the first 20 lines are printed in the R console

RStudio has a nice data browser (View(mydata)).
RKward has also a nice data browser
Paul Murrell is currently developing the rdataviewer package (pdf).

Binding row or column

Most of the times when you are working with data frames, you are changing the data and one of the several changes you can do to a data frame is adding column or row and as the result increase the dimension of your data frame. There are few different ways to do it but the easiest ones are cbind() and rbind() which are part of the base package:

mydata <- cbind(mydata, newVector)
mydata <- rbind(mydata, newVector)

Remember that the length of the newVector should match the length of the side of the data frame that you are attaching it to. For example, in the cbind() command the following statement should be TRUE:

dim(mydata)[1]==length(newVector)

To see more samples, you can always do ?base::cbind and ?base::rbind.

Attaching data

One of the big advantages of R over Stata is that you can deal with multiple datasets at the same time. You just need to specify the name of the dataset and a "$" symbol before each variable name ( for instance mydat1$var1 and mydat2$var1). If you only work with one dataset and you don't want to write again and again the name of the dataset as a prefix for each variable, you can use attach().

mydata$var1
attach(mydata)
var1
detach(mydata)

Detecting duplicates

When you want to clean up a data set, it is very often useful to check if you don't have the same information twice in the data. R provides some functions to detect duplicates.

duplicated() looks at duplicated elements and returns a logical vector. You can use table() to summarize this vector.
Duplicated() (sfsmisc) generalizes this command. Duplicated() only marks unique values with "NA".
remove.dup.rows() (cwhmisc).
unique() keeps only the unique lines in a dataset.
distinct() (dplyr) retains only unique/distinct rows from a dataset.

library("Zelig")
mydat <- gsource(
variables = "
1 1 1 1
1 1 1 1
1 2 3 4
1 2 3 4
1 2 2 2
1 2 3 2")
unique(mydat) # keep unique rows
library(cwhmisc)
remove.dup.rows(mydat) # similar to unique()
table(duplicated(mydat)) # table duplicated lines
mydat$dups <- duplicated(mydat) # add a logical variable for duplicates

Creating and removing variables

To create a new variable

mydata$newvar <- oldvar

If you want to delete a variable in a dataset, you can assign NULL to that variable :

# Delete the x variable in the df data frame.
df$x <- NULL

Renaming variables

It is possible to rename a variable by redefining the vector of names of a data frame.
There is also a rename() function in the reshape package.

df <- data.frame(x = 1:10, y = 21:30)
names(df)
names(df) <- c("toto","tata")
names(df)
names(df)[2] <- "titi"
names(df)

Creating a subset of the data

One can subset the data using subset(). The first argument is the name of the dataset, the second argument is a logical condition which say which lines will be included in the new dataset and the last argument is the list of variable which will be included in the new dataset.

In the following example, we generate a fake dataset and we use the subset() command to select the lines and columns of interest. We choose the lines such that x1 > 0 and x2 < 0 and we only keep x1 and x2 as variables.

N <- 100
x1 <- rnorm(N)
x2 <- 1 + rnorm(N) + x1
x3 <- rnorm(N) + x2
mydat <- data.frame(x1,x2,x3)
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = - x3) # the same.

It is also possible to reorder the columns using the select option.

subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x2,x1))

Sorting and ordering

order()

mydat[order(var1,var2),]

Suppose you want to randomize the order in a data set. You just need to generate a vector from a uniform distribution and to sort following that vector.

df[order(runif(nrow(df))),]

Detecting missing values

is.na() returns a logical vector equal to TRUE if any of the variable in a dataset is missing and to FALSE otherwise.
complete.cases() returns a logical vector indicating TRUE if all cases are complete and FALSE otherwise.

> table(complete.cases(df))

Reshaping a dataframe

This topic is important if you deal with panel data. Panel data can be stored in a wide format with one observation per unit and a variable for each time period or in a long format with one observation per unit and time period. reshape() reshapes a dataset in a wide or long format.

> country <- c("'Angola'","'UK'","'France'")
> gdp.1960 <- c(1,2,3)
> gdp.1970 <- c(2,4,6)
> mydat <- data.frame(country,gdp.1960,gdp.1970)
> mydat # wide format
  country gdp.1960 gdp.1970
1  Angola       1       2
2      UK       2       4
3  France       3       6
> reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format
    country time gdp id
1.1  Angola    1   1  1
2.1      UK    1   2  2
3.1  France    1   3  3
1.2  Angola    2   2  1
2.2      UK    2   4  2
3.2  France    2   6  3

varying gives the numbers of the columns which are time-varying
v.names gives the prefix of the time-varying variables
direction gives the direction, either "long" or "wide".

See also :
- reShape() (Hmisc)
- See Hadley Wickham's reshape package^[4]
- See Duncan Murdoch's tables package ^[5]

External links

Printing nested tables in R – bridging between the {reshape} and {tables} packages

Expanding a dataset

Sometimes we need to duplicate some lines in a dataset. For instance, if we want to generate a fake dataset with a panel data structure. In that case, we would first generate time invariant variables and then duplicate each line by a given scalar in order to create time-varying variables.

It is possible to use the expand() function in the epicalc package (since this package does not exist anymore, an option to expand is given in [1]). This will multiply each line by a given number.

N <- 1000
T <- 5
wide <- data.frame(id = 1:N,f = rnorm(N),  rep = T)
library("epicalc")
long <- expand(wide,index.var = "rep")
long$time <- rep(1:T,N)

We can also use the do it yourself solution or create our own function. The idea is simple. We create a vector which igives for each line the number of times it should be replicated (dups in the following example). Then we use the rep() function to create a vector which repeats the line numbers according to what we want. The last step creates a new dataset which repeats lines according to the desired pattern.

expand <- function(df,dups){
	df$dups <- dups
	pattern <- rep(1:nrow(df), times=df$dups)
	df2 <- df[pattern,]
	index <- function(x){
		1:length(x)
		}
	df2$year <- unlist(tapply(df2$dups, df2$id, index))
	df2$dups <- NULL 
	return(df2)
	}

df <- data.frame(x = rnorm(3), id = 1:3)
dups = c(3,1,2)
expand(df,dups)

Merging dataframes

Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :

We have one table describing authors :

> authors <- data.frame(
+     surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+     nationality = c("US", "Australia", "US", "UK", "Australia"),
+     deceased = c("yes", rep("no", 4)))
> authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no

and one table describing books

> books <- data.frame(
+     name = I(c("Tukey", "Venables", "Tierney",
+              "Ripley", "Ripley", "McNeil", "R Core")),
+     title = c("Exploratory Data Analysis",
+               "Modern Applied Statistics ...",
+               "LISP-STAT",
+               "Spatial Statistics", "Stochastic Simulation",
+               "Interactive Data Analysis",
+               "An Introduction to R"),
+     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                      "Venables & Smith"))
> books
      name                         title     other.author
1    Tukey     Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics ...           Ripley
3  Tierney                     LISP-STAT             <NA>
4   Ripley            Spatial Statistics             <NA>
5   Ripley         Stochastic Simulation             <NA>
6   McNeil     Interactive Data Analysis             <NA>
7   R Core          An Introduction to R Venables & Smith

We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the observations from the author dataset which match with an observation in the books dataset.

> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
> final
      name                         title     other.author nationality deceased
1    Tukey     Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics ...           Ripley   Australia       no
3  Tierney                     LISP-STAT             <NA>          US       no
4   Ripley            Spatial Statistics             <NA>          UK       no
5   Ripley         Stochastic Simulation             <NA>          UK       no
6   McNeil     Interactive Data Analysis             <NA>   Australia       no
7   R Core          An Introduction to R Venables & Smith        <NA>     <NA>

It is also possible to merge two data.frame objects while preserving the rows’ order by one of the two merged objects.^[6]

Resources

R Data Manual.^[7]
Paul Murrell's Introduction to Data Technologies.^[8]

References

↑ The AER Package http://cran.r-project.org/web/packages/AER/index.html
↑ The EcDat Package http://cran.r-project.org/web/packages/Ecdat/index.html
↑ "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule - Journal of the Royal Statistical Society, June 1899, p 283
↑ Reshaping Data with the reshape Package : http://www.jstatsoft.org/v21/i12
↑ vignette for the tables package: http://cran.r-project.org/web/packages/tables/vignettes/tables.pdf
↑ Merging data frames while preserving the rows
↑ R Data Manual http://cran.r-project.org/doc/manuals/R-data.html
↑ Paul Murrell introduction to Data Technologies http://www.stat.auckland.ac.nz/~paul/ItDT/

Previous: Random Number Generation

Index

Next: Importing and exporting data

Importing and exporting data

Data can be stored in a large variety of formats. Each statistical package has its own format for data (xls for Microsoft Excel, dta for Stata, sas7bdat for SAS, ...). R can read almost all file formats. We present a method for each kind of file. If none of the following methods work, you can use a specific software for data conversion such as the free software OpenRefine or the commercial software Stat Transfer.^[1] In any case, most statistical software can export data in a CSV (comma separated values) format and all of them can read CSV data. This is often the best solution to make data available to everyone.

Graphical user interfaces

Some IDE or GUI provides some press button solution to import data.

You may also have a look at speedR, a graphical user interface which helps at importing data from Excel, OpenOfficeCalc, CSV and other text files.^[2]

library(speedR)
speedR()

CSV (csv,txt,dat)

You can import data from a text file (often CSV) using read.table(), read.csv() or read.csv2(). The option header = TRUE indicates that the first line of the CSV file should be interpreted as variables names and the option sep = gives the separator (generally "," or ";").

csv.get() (Hmisc) is another possibility.

mydata <- read.table("data.txt",header=TRUE)
mydata <- read.table("data.csv", header = TRUE, sep=",")  # import from a CSV
mydata <- read.csv("data.csv", header=T)
mydata <- read.table("data.csv", header = TRUE, sep=";") 
mydata <- read.csv2("data.csv", header=T)

Note that there is no problem if your data are stored on the internet.

df <- read.table("http://www.mywebsite.com/.../data.csv", header = TRUE, sep = ",")

By default, strings are converted to factors. If you want to avoid this conversion, you can specify the option stringsAsFactors = FALSE.

You can export data to a text file using write.table().

write.table(mydat,file="mydat.csv",quote=T,append=F,sep=",",eol = "\n", na = "NA", dec = ".", row.names = T,col.names = T)

For large CSV files, it is possible to use the ff package.^[3]

library("ff")
df <- read.csv.ffdf(file="large_csv_file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000)

Fixed width text files

read.fwf() and write.fwf().

Some fixed width text files are provided with a SAS script to import them. Anthony Damico has created SAScii package to easily import those data.^[4]

Unstructured text files

See scan() and readLines() in the Reading and writing text files section.

Stata (dta)

We can read Stata data using read.dta() in the foreign package and export to Stata data format using write.dta().
Note that string variables in Stata are limited to 244 characters. This can be an issue during the exportation process.
See also Stata.file() in the memisc package and stata.get in the Hmisc package.

> library("foreign")
> mydata <- read.dta("mydata.dta",convert.dates = TRUE, convert.factors = TRUE, convert.underscore = TRUE)
> names(mydata)
> write.dta(mydata, file = "mydata.dta")

SAS (sas7bdat)

Experimental support for SAS databases having the sas7bdat extension is provided by the sas7bdat^[5] package. However, sas7bdat files generated by 64 bit versions of SAS, and SAS running on non-Microsoft Windows platforms are not yet supported.

SAS (xpt)

See also sasexport.get() and sas.get() in the Hmisc
See also the SASxport package.

library("foreign")
mydata<-read.xport("SASData.xpt")
names(mydata)

SPSS (sav)

read.spss() (foreign) and spss.get() (Hmisc)

> library("foreign")
> mydata<-read.spss("SPSSData.sav")
> names(mydata)

EViews

readEViews() in the hexView package for EViews files.

Excel (xls,xlsx)

Importing data from Excel is not easy. The solution depends on your operating system. If none of the methods below works, you can always export each Excel spreadsheets to CSV format and read the CSV in R. This is often the simplest and quickest solution.

XLConnect supports reading and writing both xls and xlsx file formats. Since it is based on Apache POI it only requires a Java installation and as such works on many platforms including Windows, UNIX/Linux and Mac. Besides reading & writing data it provides a number of additional features such as adding plots, cell styling & style actions and many more.

require("XLConnect")
wb <- loadWorkbook("myfile.xls", create = FALSE)
# Show a summary of the workbook (shows worksheets,
# defined names, hidden sheets, active sheet name, ...)
summary(wb)
# Read data from a worksheet interpreting the first row as column names
df1 <- readWorksheet(wb, sheet = "mysheet")
# Read data from a named region/range interpreting the first row as column
# names
df2 <- readNamedRegion(wb, name = "myname", header = TRUE)

The RODBC solution:

library("RODBC")
32-bit Windows: channel <- odbcConnectExcel("Graphiques pourcent croissance.xls") # creates a connection
64-bit Windows: channel <- odbcConnectExcel2007("Graphiques pourcent croissance.xls")
sqlTables(channel) # List all the tables
effec <- sqlFetch(channel, "effec") # Read one spreadsheet as an R table
odbcClose(channel) # close the connection (don't forget)

The xlsReadWrite package (actually, this package does not exist on CRAN repos, but you can download old versions from CRAN archive).

> library(xlsReadWrite)
mydat <- read.xls("myfile.xls", colNames = T, sheet = "mysheet", type = "data.frame", from = 1, checkNames = TRUE)

"sheet" specifies the name or the number of the sheet you want to import.
"from" specifies the first row of the spreadsheet.

The gnumeric package.^[6] This package use an external software called ssconvert which is usually installed with gnumeric, the Gnome office spreadsheet. The read.gnumeric.sheet() function reads xls and xlsx files.

library("gnumeric")
df1 <- read.gnumeric.sheet(file = "df.xls", head = TRUE, sheet.name = "Feuille1")
df2 <- read.gnumeric.sheet(file = "df.xlsx", head = TRUE, sheet.name = "Feuille1")

See also xlsx for Excel 2007 documents and read.xls() (gdata).

Google Spread Sheets

You should make the spreadsheet public, publish it as a CSV file. Then you can read it in R using read.csv(). See more on the Revolution's computing blog (link). See also RGoogleDocs (link).

# Read from a Google SpreadSheet.
require(RCurl)
myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AkuuKBh0jM2TdGppUFFxcEdoUklCQlJhM2kweGpoUUE&single=true&gid=0&output=csv")
read.csv(textConnection(myCsv))

gnumeric spreadsheets

The gnumeric package^[6]. read.gnumeric.sheet() reads one sheet and read.gnumeric.sheets() reads all sheets and store them in a list.

library("gnumeric")
df <- read.gnumeric.sheet(file = "df.gnumeric", head = TRUE, sheet.name = "df.csv")
View(df)
df <- read.gnumeric.sheets(file = "df.gnumeric", head = TRUE)
View(df$df.csv)

OpenOffice and LibreOffice (ods)

readODS does not require external dependencies, making it crossplatform.

library("readODS")
df=read.ods("df.ods")

speedR is another alternative.

library("speedR")
df <- speedR.importany(file = "df.ods")

Note that you can also use the speedR graphical user interface (speedR()) which will return the command line for replication.

library("speedR")
speedR()

JSON

JSON (JavaScript Object Notation) is a very common format on the internet. The rjson library makes it easy to import data from a json format^[7].

# json.txt : a text file including data in the JSON format
library("rjson")
df <- fromJSON(paste(readLines("json.txt"), collapse=""))

Is is easy to export a list or a dataframe to a JSON format using the toJSON() function :

# df : a data frame
library("rjson")
json <- toJSON(df)

Sometimes the JSON data can be more complex with structures such as nested arrays. In this case you may find it more useful to use an online converter like json-csv.com to convert the file to CSV. Then import the resulting data as per the CSV instructions above.

dBase (dbf)

read.dbf() in the foreign package.

library("foreign")
df  <- read.dbf("file.dbf")
str(df)

Hierarchical Data Format (hdf5)

hdf5 data can be read using the hdf5 package^[8].

SQL

DICOM and NIfTI

See "Working with the {DICOM} and {NIfTI} Data Standards in R" in the Journal of Statistical Software^[9]

Resources

R Data Manual^[10].
Paul Murrell's Introduction to Data Technologies^[11].

References

↑ Stat Transfer
↑ speedR
↑ "Opening Large CSV Files in R". Retrieved March 7, 2013. {{cite web}}: Unknown parameter |site= ignored (help)
↑ David Smith. "Importing public data with SAS instructions into R". Revolution Analytics. Retrieved February 1, 2013.
↑ sas7bdat
↑ ^a ^b This command has been tested using Ubuntu 10.10 and R 2.11.1
↑ http://cran.r-project.org/web/packages/rjson/index.html
↑ http://cran.r-project.org/web/packages/hdf5/index.html
↑ Brandon Whitcher, Volker J. Schmid, Andrew Thorton "Working with the {DICOM} and {NIfTI} Data Standards in R", Journal of Statistical Software Vol. 44, Issue 6, Oct 2011, link
↑ R Data Manual
↑ Paul Murrell introduction to Data Technologies

Previous: Data Management

Index

Next: Graphics

Text Processing

This page includes all the material you need to deal with strings in R. The section on regular expressions may be useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.

This page may be useful to :

perform statistical text analysis.
collect data from an unformatted text file.
deal with character variables.

In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of function for characters, simple functions and regular expressions. Many functions are part of the standard R base package.

help.search(keyword = "character", package = "base")

However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the stringr package which defines functions with similar behaviour but their names are easier to retain and their syntax much more systematic^[1].

Keywords : text mining, natural language processing
See CRAN Task view on Natural Language Processing^[2]
See also the following packages tm, tau, languageR, scrapeR.

Reading and writing text files

R can read any text file using readLines() or scan(). It is possible to specify the encoding of the imported text file with readLines(). The entire contents of the text file can be read into an R object (e.g., a character vector). scan() is more flexible. The kind of data expected can be specified in the second argument (e.g., character(0) for a string).

text <- readLines("file.txt",encoding="UTF-8")
scan("file.txt", character(0)) # separate each word
scan("file.txt", character(0), quote = NULL) # get rid of quotes
scan("file.txt", character(0), sep = ".") # separate each sentence
scan("file.txt", character(0), sep = "\n") # separate each line

We can write the content of an R object into a text file using cat() or writeLines(). By default cat() concatenates vectors when writing to the text file. You can change it by adding options sep="\n" or fill=TRUE. The default encoding depends on your computer.

cat(text,file="file.txt",sep="\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = FALSE)

Before reading a text file, you can look at its properties. nlines() (parser package) and countLines() (R.utils package) count the number of lines in the file. count.chars() (parser package) counts the number of bytes and characters in each line of a file. You can also display a text file using file.show().

Character encoding

Wikipedia has related information at Character encoding

R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which have been created with another operating system and especially if the language is not English and has many accents and specific characters. For instance, the standard encoding scheme in Linux is "UTF-8" whereas the standard encoding scheme in Windows is "Latin1". The Encoding() functions returns the encoding of a string. iconv() is similar to the unix command iconv and converts the encoding.

iconvlist() gives the list of available encoding scheme on your computer.
readLines(), scan() and file.show() have also an encoding option.
is.utf8() (tau) tests if the encoding is "utf8".
is.locale() (tau) tests if encoding is the same as the default encoding on your computer.
translate() (tau) translates the encoding into the current locale.
fromUTF8() (descr) is less general than iconv().
utf8ToInt() (base)

Example

The following example was run under Windows. Thus, the default encoding is "latin1".

> texte <- "Hé hé"
> Encoding(texte)
[1] "latin1"
> texte2 <-  iconv(texte,"latin1","UTF-8")
> Encoding(texte2)
[1] "UTF-8"

Regular Expressions

See also the Wikibook Regular expressions.

Wikipedia has related information at Regular expression

A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern : 2 digits, 2 letters and 4 digits. R provides powerful functions to deal with regular expressions. Two types of regular expressions are used in R^[3]

extended regular expressions, used by ‘perl = FALSE’ (the default),
Perl-like regular expressions used by ‘perl = TRUE’.

There is a also an option called ‘fixed = TRUE’ which can be considered as a literal regular expression. fixed() (stringr) is equivalent to fixed=TRUE in the standard regex functions. These functions are by default case sensitive. This can be changed by specifying the option ignore.case = TRUE.

If you are not a specialist in regular expression you may find the glob2rx() useful. This function suggests some regular expression for a specific ("glob" or "wildcard") pattern :

> glob2rx("abc.*")
[1] "^abc\\."

Functions which use regular expressions in R

sub(), gsub(), str_replace() (stringr) make some substitutions in a string.
grep(), str_extract() (stringr) extract some value
grepl(), str_detect() (stringr) detect the presence of a pattern.
see also splitByPattern() (R.utils)
See also gsubfn() in the gsubfn package.

Extended regular expressions (The default)

"." stands for any character.
"[ABC]" means A,B or C.
"[A-Z]" means any upper letter between A and Z.
"[0-9]" means any digit between 0 and 9.

Here is the list of metacharacters ‘$ * + . ? [ ] ^ { } | ( ) \’. If you need to use one of those characters, precede them with a doubled backslash.

Here are some classes of regular expressions : For numbers :

‘[:digit:]’ Digits: ‘0 1 2 3 4 5 6 7 8 9’.

For letters :

‘[:alpha:]’ Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’.
‘[:upper:]’ Upper-case letters.
‘[:lower:]’ Lower-case letters.

Note that the set of alphabetic characters includes accents such as é è ê which are very common in some languages like French. Therefore, it is more general than "[A-Za-z]" which does not include letters with accent.

For other characters :

‘[:punct:]’ Punctuation characters: ‘! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~’.
‘[:space:]’ Space characters: tab, newline, vertical tab, form feed, carriage return, and space.
‘[:blank:]’ Blank characters: space and tab.
‘[:cntrl:]’ Control characters.

For combination of other classes :

[:alnum:] Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’.
‘[:graph:]’ Graphical characters: ‘[:alnum:]’ and ‘[:punct:]’.
‘[:print:]’ Printable characters: ‘[:alnum:]’, ‘[:punct:]’ and space.
‘[:xdigit:]’ Hexadecimal digits: ‘0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f’.

You can quantify the number of repetition by adding after the regular expression the following characters :

‘?’ The preceding item is optional and will be matched at most once.
‘*’ The preceding item will be matched zero or more times.
‘+’ The preceding item will be matched one or more times.
‘{n}’ The preceding item is matched exactly ‘n’ times.
‘{n,}’ The preceding item is matched ‘n’ or more times.
‘{n,m}’ The preceding item is matched at least ‘n’ times, but not more than ‘m’ times.

^ to force the regular expression to be at the beginning of the string
$ to force the regular expression to be at the end of the string

If you want to know more, have a look at the 2 following help files :

>?regexp # gives some general explanations
>?grep # help file for grep(),regexpr(),sub(),gsub(),etc

Perl-like regular expressions

It is also possible to use "perl-like" regular expressions. You just need to use the option perl=TRUE.

Examples

If you want to remove space characters in a string, you can use the \\s Perl macro.

sub('\\s', '',x, perl = TRUE)

Concatenating strings

paste() concatenates strings.
str_c() (stringr) does a similar job.
cat() prints and concatenates strings.

Examples

> paste("toto","tata",sep=' ')
[1] "toto tata"
> paste("toto","tata",sep=",")
[1] "toto,tata"
> str_c("toto","tata",sep=",")
[1] "toto,tata"
> x <- c("a","b","c")
> paste(x,collapse=" ")
[1] "a b c"
> str_c(x, collapse = " ")
[1] "a b c"
> cat(c("a","b","c"), sep = "+")
a+b+c

Splitting a string

strsplit() : Split the elements of a character vector ‘x’ into substrings according to the matches to substring ‘split’ within them.
See also str_split() (stringr).

> unlist(strsplit("a.b.c", "\\."))
[1] "a" "b" "c"

tokenize() (tau) split a string into tokens.

> tokenize("abc defghk")
[1] "abc"    " "      "defghk"

Counting the number of characters in a string

nchar() gives the length of a string. Note that that for non-ASCII encodings, there is more one way to measure such a length.
See also str_length() (stringr)

> nchar("abcdef")
[1] 6
> nchar(NA)
[1] NA
> nchar("René")
[1] 4
> nchar("René", type = "bytes")
[1] 5

Detecting the presence of a substring

Detecting a pattern in a string ?

grepl() returns a logical expression (TRUE or FALSE).
str_detect() (stringr) does a similar job.

> string <- "23 mai 2000"
> string2 <- "1 mai 2000"
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> grepl(pattern = regexp, x = string)
[1] TRUE
> str_detect(string, regexp)
[1] TRUE
> grepl(pattern = regexp, x = string2)
[1] FALSE

The 1st one is true and the second one is false since there is only one digit in the first number.

Counting the occurrence of each pattern in a string ?

textcnt() (tau) counts the occurrence of each pattern or each term in a text.

> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> textcnt(string,n=1L,method="string")
blabla    mai 
     2      2 
attr(,"class")
[1] "textcnt"

Extracting the position of a substring or a pattern in a string

Extracting the position of a substring ?

cpos() (cwhmisc) returns the position of a substring in a string.
substring.location() (cwhmisc) does the same job but returns the first and the last position.

 
> cpos("abcdefghijklmnopqrstuvwxyz","p",start=1)
[1] 16
> substring.location("abcdefghijklmnopqrstuvwxyz","def")
$first
[1] 4

$last
[1] 6

Extracting the position of a pattern in a string ?

regexpr() returns the position of the regular expression. str_locate() (stringr) does the same job. gregexpr() is similar to regexpr() but the starting position of every match is returned. str_locate_all() (stringr) does the same job.

> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> regexpr(pattern = regexp, text = string)
[1] 8
attr(,"match.length")
[1] 11
> gregexpr(pattern = regexp, text = string)
[[1]]
[1]  8 27
attr(,"match.length")
[1] 11 11
> str_locate(string,regexp)
     start end
[1,]     8  18
> str_locate_all(string,regexp)
[[1]]
     start end
[1,]     8  18
[2,]    27  37

Extracting a substring from a string

Extracting a fixed width substring ?

substr() takes a sub string.
str_sub() (stringr) is similar.

> substr("simple text",1,3)
[1] "sim"
> str_sub("simple text",1,3)
[1] "sim"

Extracting the first word in a string ?

first.word() First Word in a String or Expression in the Hmisc package

> first.word("abc def ghk")
[1] "abc"

Extracting a pattern in a string ?

grep() returns the value of the regular expression if value=T and its position if value=F.

> grep(pattern = regexp, x = string , value = T) 
[1] "23 mai 2000"
> grep(pattern = regexp, x = string2 , value = T) 
character(0)
> grep(pattern = regexp, x = string , value = F) 
[1] 1
> grep(pattern = regexp, x = string2 , value = F) 
integer(0)

str_extract(), str_extract_all(), str_match(), str_match_all() (stringr) and m() (caroline package) are similar to grep(). str_extract() and str_extract_all() return a vector. str_match() and str_match_all() return a matrix and m() a dataframe.

> library("stringr")
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string <- "blabla 23 mai 2000 blabla 18 mai 2004"
> str_extract(string,regexp)
[1] "23 mai 2000"
> str_extract_all(string,regexp)
[[1]]
[1] "23 mai 2000" "18 mai 2004"

> str_match(string,regexp)
     [,1]          [,2] [,3]  [,4]  
[1,] "23 mai 2000" "23" "mai" "2000"
> str_match_all(string,regexp)
[[1]]
     [,1]          [,2] [,3]  [,4]  
[1,] "23 mai 2000" "23" "mai" "2000"
[2,] "18 mai 2004" "18" "mai" "2004"
> library("caroline")
> m(pattern = regexp, vect = string, names = c("day","month","year"), types = rep("character",3))
  day month year
1  18   mai 2004

Named capture regular expressions can be used to define column names in the regular expression (this also serves to document the regular expression). Install the namedCapture package via devtools::install_github("tdhock/namedCapture") to use str_match_all_named(). It uses the base function gregexpr(perl=TRUE) to parse a Perl-Compatible Regular Expression, and returns a list of match matrices with column names:

> named.regexp <- paste0(
+   "(?<day>[[:digit:]]{2})",
+   " ",
+   "(?<month>[[:alpha:]]+)",
+   " ",
+   "(?<year>[[:digit:]]{4})")
> namedCapture::str_match_all_named(string, named.regexp)
[[1]]
     day  month year  
[1,] "23" "mai" "2000"
[2,] "18" "mai" "2004"

Making some substitution inside a string

Substituting a pattern in a string

sub() makes a substitution.
gsub() is similar to sub() but replace all occurrences of the pattern whereas sub() only replaces the first occurrence.
str_replace() (stringr) is similar to sub, str_replace_all() (stringr) is similar to gsub.

In the following example, we have a French date. The regular pattern is the following : 2 digits, a blank, some letters, a blank, 4 digits. We capture the 2 digits with the [[:digit:]]{2} expression, the letters with [[:alpha:]]+ and the 4 digits with [[:digit:]]{4}. Each of these three substrings is surrounded with parenthesis. The first substring is stored in "\\1", the second one in "\\2" and the 3rd one in "\\3".

string <- "23 mai 2000"
regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
sub(pattern = regexp, replacement = "\\1", x = string) # returns the first part of the regular expression
sub(pattern = regexp, replacement = "\\2", x = string) # returns the second part
sub(pattern = regexp, replacement = "\\3", x = string) # returns the third part

In the following example, we compare the outcome of sub() and gsub(). The first one removes the first space whereas the second one removes all spaces in the text.

> text <- "abc def ghk"
> sub(pattern = " ", replacement = "",  x = text)
[1] "abcdef ghk"
> gsub(pattern = " ", replacement = "",  x = text)
[1] "abcdefghk"

Substituting characters in a string ?

chartr() substitutes characters in an expression. It stands for "character translation".
replacechar() (cwhmisc) does the same job ...
as well as str_replace_all() (stringr).

> chartr(old="a",new="o",x="baba")
[1] "bobo"
> chartr(old="ab",new="ot",x="baba")
[1] "toto"
> replacechar("abc.def.ghi.jkl",".","_")
[1] "abc_def_ghi_jkl"
> str_replace_all("abc.def.ghi.jkl","\\.","_")
[1] "abc_def_ghi_jkl"

Converting letters to lower or upper-case

tolower() converts upper-case characters to lower-case.
toupper() converts lower-case characters to upper-case.
capitalize() (Hmisc) capitalize the first letter of a string
See also cap(), capitalize(), lower(), lowerize() and CapLeading() in the cwhmisc package.

> tolower("ABCdef")
[1] "abcdef"
> toupper("ABCdef")
[1] "ABCDEF"
> capitalize("abcdef")
[1] "Abcdef"

Filling a string with some character

padding() (cwhmisc) fills a string with some characters to fit a given length. See also str_pad() (stringr).

> library("cwhmisc")
> padding("abc",10," ","center") # adds blanks such that the length of the string is 10.
[1] "   abc    "
> str_pad("abc",width=10,side="center", pad = "+")
[1] "+++abc++++"
> str_pad(c("1","11","111","1111"),3,side="left",pad="0") 
[1] "001"  "011"  "111"  "1111"

Note that str_pad() is very slow. For instance for a vector of length 10,000, we have a very long computing time. padding()does not seem to handle character vectors but the best solution may be to use the sapply() and padding() functions together.

>library("stringr")
>library("cwhmisc")
>a <- rep(1,10^4)
> system.time(b <- str_pad(a,3,side="left",pad="0"))
utilisateur     système      écoulé 
     50.968       0.208      73.322 
> system.time(c <- sapply(a, padding, space = 3, with = "0", to = "left"))
utilisateur     système      écoulé 
      7.700       0.020      12.206

Removing leading and trailing spaces

trimws() (memisc package) trim leading and trailing white spaces.
trim() (gdata package) does the same job.
See also str_trim() (stringr)

> library("memisc")
> trimws("  abc def   ")
[1] "abc def" 
> library("gdata")
> trim(" abc def ")
[1] "abc def"
> str_trim("  abd def  ")
[1] "abd def"

Comparing two strings

Assessing if they are identical

== returns TRUE if both strings are the same and false otherwise.

> "abc"=="abc"
[1] TRUE
> "abc"=="abd"
[1] FALSE

Computing distance between strings

Few packages implement the Levenshtein distance between two strings:

adist() in base package utils
stringMatch() in MiscPsycho
stringdist() in stringdist
levenshteinDist() in RecordLinkage

A benchmark comparing the speed of levenshteinDist() and stringdist() is available here: [2].

Example with utils

> adist("test","tester")
[1] 2

Example with MiscPsycho

stringMatch() (MiscPsycho) computes If normalize="YES" the levenshtein distance is divided by the maximum length of each string.

> library("MiscPsycho")
> stringMatch("test","tester",normalize="NO",penalty=1,case.sensitive = TRUE)
[1] 2

Approximate matching

agrep() search for approximate matches using the Levenshtein distance.

If 'value = TRUE', this returns the value of the string
If 'value = FALSE' this returns the position of the string
max returns the maximal levenshtein distance.

>  agrep(pattern = "laysy", x = c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
>  agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 3, value = TRUE)
[1] "1 lazy"

Miscellaneous

deparse() : Turn unevaluated expressions into character strings.
char.expand() (base) expands a string with respect to a target.
pmatch() (base) and charmatch() (base) seek matches for the elements of their first argument among those of their second.

> pmatch(c("a","b","c","d"),table = c("b","c"), nomatch = 0)
[1] 0 1 2 0

make.unique() makes a character string unique. This is useful if you want to use a string as an identifier in your data.

> make.unique(c("a", "a", "a"))
[1] "a"   "a.1" "a.2"

References

↑ Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/2, http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
↑ http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
↑
In former versions (< 2.10) we had also basic regular expressions in R :
- extended regular expressions, used by extended = TRUE (the default),
- basic regular expressions, as used by extended = FALSE (obsolete in R 2.10).
Since basic regular expressions (‘extended = FALSE’) are now obsolete, the extended option is obsolete in version 2.11.

Times and Dates

R contains a set of object types for holding date and time information. The system time and date can also be requested.

Format

Many time and date units are recognised. These include:

Unit	Symbol	Example
4 digit year	%Y	1932
2 digit year	%y	84
Numerical Month	%m	03
Full Month	%B	January
Abbreviated Month	%b	Jan
Day of the month	%d	31
Full weekday	%A	Wednesday
Abbreviated weekday	%a	Wed
Hours (24hr clock)	%H	16
Minutes	%M	35
Seconds	%S	52

The default format is yyyy-mm-dd hh:mm:ss or %Y-%m-%d %H:%M:%S

For example 2010-02-13 23:12:24

System Date and Time

To get the system date and time:

> Sys.time()
 [1] "2010-02-13 23:12:24 COT"
> format(Sys.time(),"%H %M")   # in a different format and without the date
 [1] "23 13"
> Sys.Date()
 [1] "2010-02-13"
> date()                       # returns the current date and time,
[1] "Wed Jul 18 10:59:42 2012"

Convert strings to date/time objects

Convert a string representing the date or time into a Date/Time object:

> my.date <- as.Date("2010-12-30")
> print(my.date)
 [1] "2010-12-30"
> my.date2 <- as.Date("12/20/30", format="%m/%d/%y") # input date in a different format
> print(my.date2)
 [1] "2030-12-20"
> my.time <- strptime("12/20/30 14.34.35", format="%m/%d/%y %H.%M.%S") # input time and date
> print(my.time)
 [1] "2030-12-20 14:34:35"
> my.string <- as.character(Sys.time()) # convert a date/time object to a normal string
> print(my.string)
 [1] "2016-06-30 23:04:44"

Extracting information from dates

Get weekday, month and an integer representing the number of days since the beginning of epoch:

> weekdays(my.date) # Get a string representing the weekday of the specified date
[1] "Monday"
> months(my.date)
[1] "December" # Get the month as well
> my.date
[1] "2010-12-20"
> julian(my.date) # Get the integer number of days since the beginning of epoch
[1] 14963
attr(,"origin")
[1] "1970-01-01"

Note that weekdays() and months() returns results in the local language. For instance, if you turn R into French, you can get weekdays and months in French^[1] :

> require("lubridate")
> Sys.setlocale(locale="fr_FR.UTF-8")
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8"
> mydate  <- ymd("2002-04-21")
> weekdays(mydate)
[1] "Dimanche"
> months(mydate)
[1] "avril"

Generating sequences of dates

> seq(from = as.Date("01/01/12", "%d/%m/%y"), to = as.Date("10/01/12","%d/%m/%y"), by = "day")
#create the 10 first days of January 2012
 [1] "2012-01-01" "2012-01-02" "2012-01-03" "2012-01-04" "2012-01-05" "2012-01-06"
 [7] "2012-01-07" "2012-01-08" "2012-01-09" "2012-01-10"

> seq(from = as.Date("20/01/12", "%d/%m/%y"), to = as.Date("20/12/12","%d/%m/%y"), by = "month")
#create the 20th of each month in 2012
 [1] "2012-01-20" "2012-02-20" "2012-03-20" "2012-04-20" "2012-05-20" "2012-06-20"
 [7] "2012-07-20" "2012-08-20" "2012-09-20" "2012-10-20" "2012-11-20" "2012-12-20"

> seq(from = as.Date("01/01/12", "%d/%m/%y"), to = as.Date("31/01/12","%d/%m/%y"), length.out = 16)
#create a sequence of every other day in january 2012
 [1] "2012-01-01" "2012-01-03" "2012-01-05" "2012-01-07" "2012-01-09" "2012-01-11"
 [7] "2012-01-13" "2012-01-15" "2012-01-17" "2012-01-19" "2012-01-21" "2012-01-23"
[13] "2012-01-25" "2012-01-27" "2012-01-29" "2012-01-31"

References

↑ Issue on Stackoverflow

External links

Do more with dates and times in R with lubridate 1.1.0 (a sample chapter from the “R in Action” book)

Graphics

R includes at least three graphical systems, the standard graphics package, the lattice package for Trellis graphs^[1] and the grammar-of-graphics ggplot2 package^[2]. R has good graphical capabilities but there are some alternatives like gnuplot.

Interactive Graphics

This section discuss some ways to draw graphics without using R scripts.

The playwith package provides a graphical user interface to customize the graphs, add a title, a grid, some text, etc and it exports the R code you need if you want to replicate the analysis^[3]. If you want to know more, you can have a look at the screenshots on the website (link). See also the example on "R you Ready" [3]. This package require GTK+ libraries.

library("playwith")
playwith(plot(x1))

There is also a graphical user interface GrapheR which makes it very easy to draw graphs for beginners^[4]. This solution is cross-platform.

> library(GrapheR)

latticist (link) is another similar project.

Note also that some graphical user interface such as RKward and R Commander makes it easy to draw graphs.

Standard R graphs

In this section we present what you need to know if you want to customize your graphs in the default graph system.

plot() is the main function for graphics. The arguments can be a single point such as 0 or c(.3,.7), a single vector, a pair of vectors or many other R objects.
par() is another important function which defines the default settings for plots.
There are many other plot functions which are specific to some tasks such as hist(), boxplot(), etc. Most of them take the same arguments as the plot() function.

> N <- 10^2
> x1 <- rnorm(N) 
> x2 <- 1 + x1 + rnorm(N)
> plot(0) 
> plot(0,1) 
> plot(x1) 
> plot(x1,x2) # scatter plot x1 on the horizontal axis and x2 on the vertical axis
> plot(x2 ~ x1) # the same but using a formula (x2 as a function of x1)
> methods(plot) # show all the available methods for plot (depending on the number of loaded packages).

Titles, legends and annotations

Titles

main gives the main title, sub the subtitle. They can be passed as argument of the plot() function or using the title() function. xlab the name of the x axis and ylab the name of the y axis.

 plot(x1,x2, main = "Main title", sub = "sub title" , ylab = "Y axis", xlab = "X axis")
 plot(x1,x2 ,  ylab = "Y axis", xlab = "X axis")
 title(main = "Main title", sub = "sub title" )

The size of the text can be modified using the parameters cex.main, cex.lab, cex.sub, cex.axis. Those parameters define a scaling factor, ie the value of the parameter multiply the size of the text. If you choose cex.main=2 the main title will be twice as big as usual.

Legend

legend(). The position can be "bottomleft", "bottomright", "topleft", "topright" or exact coordinates.

plot(x1, type = "l", col = 1, lty = 1) 
lines(x2, col = 2, lty = 2) 
legend("bottomleft", legend = c("x1","x2"), col = 1:2, lty = 1:2)

Text in the margin

mtext() puts some texts in the margin. The margin can be at the bottom (1), the left (2), the top (3) or the right (4).

plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 1) # the bottom
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 2) # the left
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 3) # the top
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 4) # the right margin

Text in the graph

text()

Mathematical annotations

We can add mathematical symbols using expression() and makes some substitution in a formula using substitute().

?plotmath # gives help for mathematical annotations

Types

The type of a plot can be :

n for none (nothing is printed),
p for points,
l for lines,
b for both,
o for both overlayed,
h for histogram-like
and s/S for steps.

R code

Output

x1 <- rnorm(50) 
png("plottype.png")
par(mfrow = c(2,2))
plot(x1, type = "p", main = "points", ylab = "", xlab = "")
plot(x1, type = "l", main = "lines", ylab = "", xlab = "")
plot(x1, type = "b", main = "both", ylab = "", xlab = "")
plot(x1, type = "o", main = "both overplot", ylab = "", xlab = "")
dev.off()

Axes

The default output print the axes. We can remove them with axes=FALSE. We can also change them using the axis() function.

> plot(x1,x2,axes=FALSE)
>
> plot(x1,x2,axes=FALSE)
> axis(1,col="red",col.axis="blue",font.axis=3)
> axis(2,col="red",col.axis="blue",font.axis=2,las=2)

las specifies the style of axis labels. It can be 0, 1, 2 or 3.

0 : always parallel to the axis [default],
1 : always horizontal,
2 : always perpendicular to the axis,
3 : always vertical.

R code

Output

x1 <- rnorm(100)
par(mfrow = c(2,2))
plot(x1, las = 0, main = "las = 0", sub = "always parallel to the axis", xlab = "", ylab = "")
plot(x1, las = 1, main = "las = 1", sub = "always horizontal", xlab = "", ylab = "") 
plot(x1, las = 2, main = "las = 2", sub = "always perpendicular to the axis", xlab = "", ylab = "")
plot(x1, las = 3, main = "las = 3", sub = "always vertical", xlab = "", ylab = "")

It is also possible to add another y axis on the right by adding axis(4,).

Margins

Margins can be computed in inches or in lines. The default is par(mar = c(5,4,4,2)) which means that there are 5 lines at the bottom, 4 lines on the left, 4 lines in the top and 2 lines on the right. This can be modified using the par() function. If you want to specify margins in inches, use par(mai = c(bottom, left, top, right). If you want to modify margins in lines, use par(mar = c(bottom, left, top, right). See ?par to learn more about the topic.

Colors

The color of the points or lines can be changed using the col argument, fg for foreground colors (boxes and axes) and bg for background colors.

show.col(object=NULL) (Hmisc) package plots the main R colors with their numeric code.
The list of all colors in R (pdf)

colors() # list the r colors
show.col(object=NULL) # graphs the main R colors
plot(x1, col = "blue")
plot(x1, col = "red")
plot(x1, col = "red", col.axis = "dodgerblue", col.lab = "firebrick", col.main = "darkgreen", col.sub = "cyan4", main = "Testing colors", sub = "sub titles", ylab = "y axis", xlab = "x axis")

We can also generate new colors using the rgb() function. The first argument is the intensity of red, the second, the intensity of green and the third, the intensity of blue. They vary between 0 and 1 by default but this can be modified with the option max = 255. col2rgb() returns the RGB code of R colors. col2hex() (gplots) gives the hexadecimal code. col2grey() and col2gray() (TeachingDemos) converts colors to grey scale.

> mycolor <- rgb(.2,.4,.6)
> plot(x1, col = mycolor)
> col2rgb("pink")
      [,1]
red    255
green  192
blue   203
> library("gplots")
> col2hex("pink")
[1] "#FFC0CB"

Points

For points the symbols can be changed using the pch option which takes integer values between 0 and 25 or a single character. pch can also takes a vector as argument. In that case the first points will use the first element of the vector as symbol, and so on.

plot(x1, type = "p", pch = 0)
plot(x1, type = "p", pch = 10)
plot(x1, type = "p", pch = 25)
plot(x1, type = "p", pch = "a")
plot(x1, type = "p", pch = "*")
plot(x1[1:26], type = "p", pch = 0:25)
plot(x1[1:26], type = "p", pch = letters)

The following code displays all the symbols on the same plot :

x <- rep(1,25)
plot(x, pch = 1:25, axes = F, xlab = "", ylab = "")
text(1:25,.95,labels = 1:25)

points() adds points to an existing plot.

> plot(x1, pch = 0) # plot x1 
> points(x2, pch = 1, col = "red") # add x2 to the existing plot

Lines

We can change the line type with lty. The argument is a string ("blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash") or an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash). The line width can be changed with lwd. The default is lwd=1. lwd=2 means that the width is twice the normal width.

plot(x1, type = "l", lty = "blank")
plot(x1, type = "l", lty = "solid")
plot(x1, type = "l", lty = "dashed")
plot(x1, type = "l", lty = "dotted")
plot(x1, type = "l", lty = "dotdash")
plot(x1, type = "l", lty = "longdash")
plot(x1, type = "l", lty = "twodash")

lines() adds an additional lines on a graph.

plot(x1, type = "l", lty = "solid")
lines(x2, type = "l", lty = "dashed", col = "red")

abline() adds an horizontal line (h=), a vertical line (v=) or a linear function to the current plot (a= for the constant and b= for the slope). abline() can also plot the regression line.

> plot(x1, type = "l", lty = "solid")
> abline(h= -3, lty = "dashed", col = "gray")
> abline(v = 0, lty = "dashed", col = "gray")
> abline(a = -3 , b = .06, lty = "dotted", col = "red")

Boxes

Each graph is framed by a box. bty specifies the box type.

plot(x1, bty = "o") # the default
plot(x1, bty = "n") # no box
plot(x1, bty = "l")
plot(x1, bty = "7")
plot(x1, bty = "u")
plot(x1, bty = "c")
plot(x1, bty = "]")

See also box() to add a box to an existing plot.

Grid

grid() adds a grid to the current graph.

> plot(x1)
> grid()

Although grid has an optional argument nx for setting the number of grid lines, it is not possible to tell it explicitly where to place those lines (it will usually not place them at integer values). A more precise and manageable alternative is to use abline().

> abline(v=(seq(0,100,5)), col="lightgray", lty="dotted")
> abline(h=(seq(0,100,5)), col="lightgray", lty="dotted")

Arrows and segments

Polygons

Other figures

We can also add a circle to a plot with the circle() function in the calibrate package.

Background

You can choose the background of your plot. For instance, you can change the background color with par(bg=).

par(bg="whitesmoke")
par(bg="transparent")

Overlaying plots

matplot() can plot several plots at the same time.

N <- 100
x1 <- rnorm(N)
x2 <- rnorm(N) + x1 + 1
y <- 1 + x1 + x2 + rnorm(N)
mydat <- data.frame(y,x1,x2)
matplot(mydat[,1],mydat[,2:3], pch = 1:2)

Multiple plots

With par() we can display multiple figures on the same plot. mfrow = c(3,2) prints 6 figures on the same plot with 3 rows and 2 columns. mfcol = c(3,2) does the same but the order is not the same.

par(mfrow = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")

par(mfcol = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")

Plotting a function

curve() plots a function. This can be added to an existing plot with the option add = TRUE.
plot() can also plots functions.

curve(x^2, from = -1 , to = 1, main = "Quadratic function", ylab = "f(x)=x^2")

plot(rnorm(100))
curve((x/100)^2, add = TRUE, col = "red")

square root function, made using plot()

Exporting graphs

How can you export a graph ?

First you can plot the graph and use the context menu (right click on Windows and Linux or control + click on Mac) to copy or save the graphs. The available options depend on your operating system. On Windows, you can also use copy the current graph to the clipboard as a Bitmap file (raster graphics) using CTRL + C or as a Windows Metafile (vector graphics) using CTRL + W. You can then paste it into another application.
You can export a plot to pdf, png, jpeg, bmp or tiff by adding pdf("filename.pdf"), png("filename.png"), jpeg("filename.jpg"), bmp("filename.bmp") or tiff("filename.tiff") prior to the plotting, and dev.off() after the plotting.
You can also use the savePlot() function to save existing graphs.
Sweave also produce ps and pdf graphics (See the Sweave section).

It is better to use vectorial devices such as pdf, ps or svg.

How can you know the list of all available devices ?

?Devices
Use the capabilities() function to see the list of available devices on your computer.

?Devices
> capabilities()
    jpeg      png     tiff    tcltk      X11     aqua http/ftp  sockets 
    TRUE     TRUE     TRUE     TRUE    FALSE    FALSE     TRUE     TRUE 
  libxml     fifo   cledit    iconv      NLS  profmem    cairo 
    TRUE    FALSE     TRUE     TRUE     TRUE     TRUE    FALSE

png("r_plot.png", width = 420, height = 340)
plot(x1, main = " Example")
dev.off()

pdf("r_plot.pdf", width = 420, height = 340) 
plot(x1, main = " Example")
dev.off()

postscript(file="graph1.ps",horizontal=F,pagecentre=F,paper="special",width=8.33,height=5.56) 
plot(x1, main = "Example")
dev.off()

plot(x1, main = "Example")
savePlot("W:/Bureau/plot.pdf", type = "pdf")
savePlot("W:/Bureau/plot.png", type = "png")

We can also export to SVG using the svg() function.

svg("scatterplot.svg", width = 7, height = 7)
plot(x, y)
dev.off()

The RSvgDevice library which was used in earlier versions of R seems now outdated.

Advanced topics

Animated plots

The animation package provides dynamic graphics capabilities. It is possible to export the animation in flash, mpeg or gif format. There are more example on the aniwiki website : http://animation.yihui.name/.

You can also create motion charts using the googleVis package^[5].

Examples

Interactive Graphics

The iplots package provides a way to have interactive data visualization in R^[6] ·^[7].

R GUI now offers interactive graphics – Deducer 0.4-2 connects with iplots

To create an interactive, animated plot viewable in a web browser, the animint package can be used. The main idea is to define an interactive animation as a list of ggplots with two new aesthetics:

showSelected=variable means that only the subset of the data that corresponds to the selected value of variable will be shown.
clickSelects=variable means that clicking a plot element will change the currently selected value of variable.

Graphics gallery

In this section, we review all kind of statistical plots and review all alternatives to draw them using R. This include code for the standard graphics package, the lattice package and the ggplot2 package. Also, we add some examples from the commons repository. We only add examples which are provided with the R code. You can click on any graph and find the R code.

Wikimedia Commons has media related to: Category:Created with R

Line plot

To draw a line plot, use the generic plot() function by setting type="l".

> x <- seq(0, 2*pi, pi/10)
> plot(x, sin(x), type="l")

Then, you can add further lines on the same plot using the lines() function.

> lines(x, cos(x))

Examples

Scatter plot

Wikipedia has related information at Scatter plot

plot(x,y)
plot(y ~ x)
xyplot(y ~ x) (lattice)
qplot(x,y) (ggplot2)

Log scale

Sometimes it is useful to plot the log of a variable and to have a log scale on the axis. It is possible to plot the log of a variable using the log option in the plot() function.

For a log log plot, use log = "xy"
For a log in the x axis only, use log = "x"
For a log in the x axis only, use log = "y"

plot(x, y , log = "xy")

Label points in a plot

It is possible to add labels with the text() function.
textxy() (calibrate) makes it easy to add labels.

N <- 10
u <-rnorm(N)
x <- 1 + rnorm(N)
y <- 1 + x + u
plot(x, y)
textxy(x, y,labs = signif(x,3), cx=0.7)

Examples

Histogram

Wikipedia has related information at histogram

hist()
histogram() (lattice)

You can learn more about histograms in the Non parametric methods page.

Examples

Box plot

Box plot :

boxplot()

Examples

Bar charts

See Bar charts on wikipedia.

barplot() takes a table as argument and returns a bar chart.
qlot() (ggplot2) with the option geom = "bar" takes a variable as argument and returns a bar chart^[8].
barchart() takes a variable as argument and returns a bar chart.

Examples

Dot plot

Examples

Pie charts

Wikipedia has related information at Pie chart

pie()

Examples

Treemap

The tmPlot() function in the treemap package makes it easy to draw a treemap.

Confidence interval plot

Standard error bar chart are very useful to plot several estimates with confidence intervals.

The Hmisc package has an errbar() function. This function takes the upper and lower bounds of the confidence intervals as argument^[9].

coefplot() function in Gelman and Hill's arm package. This functions is designed to display estimation results. It takes point estimates and standard errors as arguments.

coefs <- c(0.2, 1.4, 2.3, 0.5,.3) # vector of point estimates
se <- c(0.12, 0.24, 0.23, 0.15,.2) # standard errors of point estimates
variable <- 1:5 # variable names
library("arm")
# we use CI = qnorm(.975) to have 95% confidence interval
coefplot(coefs, se, variable, vertical = T, CI = qnorm(.975)) 
coefplot(coefs, se, variable, vertical = F, CI = qnorm(.975))
library("Hmisc")
errbar(variable, coefs, coefs - qnorm(.975) * se, coefs + qnorm(.975) * se)

3D plots

contour(), image(), persp()
plot3d() (rgl)
wireframe() (lattice)

Examples

click on the graph to see the R code
Example with wireframe() (lattice)

Diagrams

grid package by Paul Murrell^[10]
diagram package ^[11]
Rgraphviz package
igraph package

Arc Diagrams

It is also possible to draw Arc Diagrams^[12].

Dendrograms

It is possible to plot dendrograms in R^[13].

Treemap

It is possible to draw a treemap using the treemap() function in the treemap package^[14].

Wordcloud

There is :

the wordcloud() function in the wordcloud package
the tagcloud() function in the tagcloud package

Timeline

timeline() in the timeline package

Maps

Resources

Tables 2 Graphs
R Graphics by Paul Murrell^[15]
ggplot2 ^[16]
Graphical Parameters [4]

References

↑ D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer, 2008. ISBN 9780387759685.
↑ ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham and a list of examples on his own website : http://had.co.nz/ggplot2/
↑ playwith : http://code.google.com/p/playwith/
↑ Hervé, Maxime (2011). "GrapheR: a Multiplatform GUI for Drawing Customizable Graphs in R" (PDF). The R Journal. 3 (2).
↑ Tutorial for the googleVis package : http://stackoverflow.com/questions/4646779/embedding-googlevis-charts-into-a-web-site/4649753#4649753
↑ http://www.r-bloggers.com/interactive-graphics-with-the-iplots-package-from-%E2%80%9Cr-in-action%E2%80%9D/
↑ http://www.r-statistics.com/2012/01/interactive-graphics-with-the-iplots-package-from-r-in-action/ Interactive Graphics with the iplots Package] - a chapter from the R in action book
↑ Hadley Wickham ggplot2: Elegant Graphics for Data Analysis, Springer Verlag, 2009
↑ The default output in errbar() changed between R version 2.8.1 and R version 2.9.2. Axis are not displayed by default anymore
↑ Paul Murrell Drawing Diagrams with R, The R Journal, 2009 http://journal.r-project.org/2009-1/RJournal_2009-1_Murrell.pdf
↑ (example: Using a binary tree diagram for describing a Bernoulli process)
↑ Gaston Sanchez (Feburary 3rd, 2013). "Arc Diagrams in R: Les Miserables". Retrieved February 5th, 2013. {{cite web}}: Check date values in: |accessdate= and |date= (help)
↑ Gaston Sanchez (October 3, 2012). "7+ ways to plot dendrograms in R". Retrieved February 5th, 2013. {{cite web}}: Check date values in: |accessdate= and |date= (help); line feed character in |date= at position 9 (help)
↑ http://cran.r-project.org/web/packages/treemap/treemap.pdf
↑ http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html
↑ http://had.co.nz/ggplot2/

Previous: Data Management

Index

Next: Descriptive Statistics

Grammar of graphics

Hadley Wickham has developped the ggplot2, a graphical library designed according to the principles of the Grammar of Graphics.

Plotting a function

We use qplot() with the option stat=function :

# Plot the quadratic function
square <- function(x){
  x^2
}
mode(square)
qplot(c(0, 2), stat = "function", fun = square, geom = "line")

Here is another example with the sinus function :

# plot the sinus functon
qplot(c(-10, 10), stat = "function", fun = sin, geom = "line")

Bibliography

Leland Wilkinson, The Grammar of Graphics (Statistics and Computing), Springer, 2005
Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis, Use R!, Springer, 2009

Resources

Documentation page for ggplot2

Publication quality ouput

Formatting numbers

You can use the format() function to control the number of digits and other characteristics of a displayed object.

> df <- data.frame(x = rnorm(10), y = rnorm(10))
> print(df)
            x          y
1  -0.4350953 -0.6426477
2  -0.5947293 -0.2389625
3  -0.7061850 -2.4382016
4  -0.3384038 -0.6322842
5   0.2713353  0.5396409
6  -1.1144711 -2.0321274
7  -1.0356184  1.7217443
8  -2.6665278 -0.3621377
9   0.2975570  0.1598905
10  1.4631458 -0.7995652
> print(format(df, digits=3, scientific=T))
           x         y
1  -4.35e-01 -6.43e-01
2  -5.95e-01 -2.39e-01
3  -7.06e-01 -2.44e+00
4  -3.38e-01 -6.32e-01
5   2.71e-01  5.40e-01
6  -1.11e+00 -2.03e+00
7  -1.04e+00  1.72e+00
8  -2.67e+00 -3.62e-01
9   2.98e-01  1.60e-01
10  1.46e+00 -8.00e-01

Sweave

Wikipedia has related information at Sweave

Sweave^[1] is a literate programming language which integrates LaTeX and R code. The Sweave file generates a LaTeX file and an R file which can in turn be compiled. Roger Koenker^[2], Meredith and Racine (2009)^[3] and Charles Geyer^[4] argue that Sweave favors reproducible econometric/statistical research.

There are some alternatives to Sweave for literate programming. One of them is Babel which is included in Emacs Orgmode^[5]. This tool allow export to LaTeX and HTML. It is also possible to include code chunks for various programming languages (R, Ruby, etc).

Syntax

The main idea is that you write a file which includes LaTeX and R code. LaTeX code begins with @ and R code with <<>>= (some options can be included between << and >>).

@
% Some LaTeX code
\section{Results}
I show that ...
<<>>=
# Some R code
qnorm(.975)
@
% Some LaTeX code
$$
\Phi^{-1}(.975) = 1.96 
$$

The file is stored with extension .Rnw or .rnw. At the end, you extract from this file an R file using Stangle() and a LaTeX file using Sweave(). Here is an example with a file called file.Rnw which generates file.tex and file.R

> Sweave("file.Rnw")
Writing to file file.tex
Processing code chunks ...
 1 : echo keep.source term verbatim pdf
 2 : echo keep.source term verbatim pdf
> Stangle("file.Rnw")
Writing to file file.R

Then you can run LaTeX on your file.tex. This can be done using the system() function or texi2dvi().

# Example under Windows :
system("pdflatex.exe -shell-escape file.tex") # runs pdflatex
system("open file.pdf") # opens the pdf

Note that you may need to download Sweave.sty from the internet since it is not part of the standard MikTeX distribution.

You can also add your results in your text using the \Sexpr{} function.

$
\Phi^{-1}(.975) = \Sexpr{qnorm(.975)} 
$

Options

There are some options. These options can be included for each code chunk or in the Sweave command.

For figures, you can either include them in the tex file using fig=T or not include them using fig=F.

By default, figures are exported as pdf and eps files. If you only want one format suppress the other one with pdf=F or eps=F option.

The R code can be displayed in the tex file using echo=T. If you don't want to include it in the tex file, use echo=F.
The R code can be evaluated using eval=T. If you don't want to evaluate the R code, use eval=F.
The results :
- results=tex treats the output as LaTeX code
- results=verbatim treats the output as Verbatim (the default)
- results=hide does not include the results in the LaTeX output

These options can be passed to the Sweave() function.

Sweave("file.Rnw", pdf = T, eps=F, echo = F, results = "verbatim")

They can also be passed to each code chunk.

<<fig=T,pdf=T,eps=F>>=
plot(rnorm(100), col = "red")
@

Text editor for Sweave

The main issue with Sweave is that few text editors include syntax highlighting for Sweave. Here are some exceptions :

RStudio is a very good solution. It is easy to install and use and it includes buttons to run Sweave files.
Vim provides syntax highlighting for Sweave file (R no web syntax)
Emacs + ESS (Emacs Speaks Statistics) provides full support for Sweave file. It includes a keyboard shortcut to run Sweave files and syntax highlighting switching between LaTeX and R.
Eclipse StatET plugin provides support for Sweave (LaTeX/R) documents with all basic features (syntax highlighting, bracket matching, toggle comment, ...) and with detection of R chunks.

Pubprint

Pubprint is a small utility that is able to transform the output of statistical tests to publication ready output. Pubprint is able to export outputs to severall formats (HTML, LaTeX, Markdown and plain text), but unfortunately supports only the APA style (publication style of the American Psychological Association). However, this style is widely used and may be appropriate in more cases.

Example

> library("pubprint")
> pprint(t.test(rnorm(30), rnorm(30)))
[1] "(\\ensuremath{M\\ifmmode_{x}\\else\\textsubscript{x}\\fi=-0.05,M\\ifmmode_{y}\\else\\textsubscript{y}\\fi=0.09,t[57.74]=-0.49,p=.628})"

Obviously pubprint prints a LaTeX formatted string, but changing the output format is possible (according to the manual pubprint is intended to use with knitr and detects output format automatically if it is used with it):

> pp_opts_out$set(pp_init_out("plain"))
> pprint(t.test(rnorm(30), rnorm(30)))
[1] "(M_x=-0.14,M_y=-0.24,t[57.4]=0.41,p=.682)"
> pprint(cor.test(rnorm(30), rnorm(30)))
[1] "(r=-.08,p=.693)"

The output can be pasted into a documented or may included in a knitr/sweave \Sexpr{} statement.

Export to LaTeX

R has lots of functions which allow it to export results to LaTeX^[9].

General functions

toLatex() in the utils package.

Note that toLatex() does not handle matrices.
toLatex() has been adapted to handle matrices and ftables in the memisc package.

> toLatex(sessionInfo())
\begin{itemize}
  \item R version 2.2.0, 2005-10-06, \verb|powerpc-apple-darwin7.9.0|
  \item Base packages: base, datasets, grDevices,
    graphics, methods, stats, utils
\end{itemize}

mat2tex() (sfsmisc) exports matrix to LaTeX.
tex.table() (cwhmisc) package exports a dataframe into a LaTeX table.

> tex.table(mydat)
\begin{table}[ht]
\begin{center}
\begin{footnotesize}
\begin{tabular}{r|rrr}
\hline
 & y & x1 & x2\\ \hline
1 & -0.09 & -0.37 & -1.04\\ 
2 & 0.31 & 0.19 & -0.09\\ 
3 & 3.78 & 0.58 & 0.62\\ 
4 & 2.09 & 1.40 & -0.95\\ 
5 & -0.18 & -0.73 & -0.54\\ 
6 & 3.16 & 1.30 & 0.58\\ 
7 & 2.78 & 0.34 & 0.77\\ 
8 & 2.59 & 1.04 & 0.46\\ 
9 & -1.96 & 0.92 & -0.89\\ 
10 & 0.91 & 0.72 & -1.1\\ 
\hline
\end{tabular}
\end{footnotesize}
\end{center}
\end{table}

xtable() (xtable) exports various objects, including tables, data frames, lm, aov, and anova, to LaTeX.

> # lm example
> library(xtable)
> x <- rnorm(100)
> y <- 2*x + rnorm(100)
> lin <- lm(y~x)
> xtable(lin)
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 21:54:04 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
  \hline
 & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ 
  \hline
(Intercept) & -0.0407 & 0.0984 & -0.41 & 0.6803 \\ 
  x & 2.0466 & 0.1043 & 19.63 & 0.0000 \\ 
   \hline
\end{tabular}
\end{center}
\end{table}

> # table example
> x <- sample(1:10, 30, replace = T)
> tab <- table(x)
> tab <- cbind(tab, prop.table(tab))
> colnames(tab) <- c("N.", "Prop.")
> xtable(tab, digits = c(0, 0, 2))
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 22:06:36 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrr}
  \hline
 & N. & Prop. \\ 
  \hline
1 & 5 & 0.17 \\ 
  3 & 1 & 0.03 \\ 
  4 & 3 & 0.10 \\ 
  5 & 6 & 0.20 \\ 
  6 & 5 & 0.17 \\ 
  7 & 3 & 0.10 \\ 
  8 & 2 & 0.07 \\ 
  9 & 2 & 0.07 \\ 
  10 & 3 & 0.10 \\ 
   \hline
\end{tabular}
\end{center}
\end{table}

Descriptive statistics

estout package.
The reporttools package include some functions for table of descriptive statistics^[10].

Estimation results

The stargazer package provides an easy way to export the results of regressions to LaTeX^[11]
texreg provides the same kind of features^[12].
The estout package provides functions similar to the Stata's esttab and estout utilities^[13]. Estimates are stored using eststo() and printed using esttab(). They can be exported to CSV and LaTeX. These functions support lm, glm and plm objects (see plm package).
apsrtable() (apsrtable) exports the results of multiple regression to LaTeX in a way similar to the American Political Science Review publication standard.
The xtable (xtable package) exports dataframes, matrix, estimation results^[14]. xtable() can also be used to export the results to an HTML file.
The outreg() function^[15] developped by Paul Johnson is similar to the Stata outreg^[16] function. See "R you ready ?" post on this topic.
mtable() and toLatex() in the 'memisc package.

N <- 10^3
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
y <- 1 + x1 + x2 + u
lm1 <- lm(y ~ x1 + x2 )
lm2 <- lm(y ~ x1 + x2 + I(x1*x2))

library(estout)
estclear() # clear all the eststo objects
eststo(lm1) 
eststo(lm2)
esttab() # print it

library("apsrtable")
apsrtable(lm1,lm2)

library(xtable)
xtable(lm1)
tab <- xtable(lm1)
print(tab,type="html")

source("http://pj.freefaculty.org/R/WorkingExamples/outreg-worked.R")
outreg(list(lm1,lm2))

library("memisc")
toLatex(mtable(lm1,lm2))

Export to HTML

The rpublisher^[17] is a literate programming language which publish results in HTML (it is based on python and was last updated in 2008).

See R2HTML, xtable, hwriter, prettyR, highlight, HTMLUtils

wiki.table() in the hacks package export a matrix or a dataframe into Mediawiki table markup (as used on this wiki and many others).

> wiki.table(matrix(1:16,4),caption="Test")
{|  
|+ Test 
| 1 || 5 || 9 || 13 
|-
| 2 || 6 || 10 || 14 
|-
| 3 || 7 || 11 || 15 
|-
| 4 || 8 || 12 || 16 
|}

References

↑ The Sweave Homepage http://www.stat.uni-muenchen.de/~leisch/Sweave/
↑ http://www.econ.uiuc.edu/~roger/repro.html
↑ Meredith, E. and J.S. Racine (2009), “Towards Reproducible Econometric Research: The Sweave Framework,” Journal of Applied Econometrics, Volume 24, pp 366-374.
↑ Charles Geyer "Why Reproducible Research is the Right Thing" http://www.stat.umn.edu/~charlie/Sweave/
↑ Babel in Emacs Orgmode http://orgmode.org/worg/org-contrib/babel/intro.html
↑ Ista ZahnLearning To Sweave in APA Style, The PracTeX Journal 2008, 1
↑ decumar git archive : http://github.com/hadley/decumar
↑ wikirobot http://r-forge.r-project.org/projects/wikirobot/
↑ See the LaTeX Wikibook if you want to learn about LaTeX
↑ reporttools: R Functions to Generate LaTeX Tables of Descriptive Statistics
↑ http://www.r-statistics.com/2013/01/stargazer-package-for-beautiful-latex-tables-from-r-statistical-models-output/
↑ http://www.r-bloggers.com/texreg-a-package-for-beautiful-and-easily-customizable-latex-regression-tables-from-r/
↑ estout : http://repec.org/bocode/e/estout/
↑ xtable on dataninja blog
↑ The outreg() function http://pj.freefaculty.org/R/WorkingExamples/outreg-worked.R
↑ Stata outreg http://ideas.repec.org/c/boc/bocode/s375201.html
↑ rpublisher : http://code.google.com/p/rpublisher/

Previous: Text Processing

Index

Descriptive Statistics

In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes univariate and bivariate statistical tools.

Generic Functions

We introduce some functions to describe a dataset.

names() gives the names of each variable
str() gives the structure of the dataset
summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.

> summary(mydat)

describe() (Hmisc package) gives more details than summary()

> library("Hmisc")
> describe(mydat)

contents() (Hmisc package)
dims() in the Zelig package.
descr() in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for factors and length for character vectors.
whatis() (YaleToolkit) gives a good description of a dataset.
detail() in the SciencesPo package gives a broad range of statistics for continuous variables, frequency tables for factors and length for character vectors.
describe() in the psych package also provides summary statistics:

> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
            var   n  mean   sd median trimmed  mad    min   max range  skew kurtosis   se
sample.data   1 100  0.37 3.21   0.00    0.07 0.31  -9.02 24.84 33.86  4.79    36.91 0.32
x             2 100  0.54 0.28   0.56    0.55 0.35   0.02  1.00  0.98 -0.12    -1.13 0.03
z             3 100  0.12 6.28   0.02   -0.01 1.14 -30.40 37.93 68.33  1.49    22.33 0.63
y             4 100 -0.01 1.07   0.09   -0.02 1.12  -2.81  2.35  5.16  0.00    -0.30 0.11

Univariate analysis

Continuous variable

Moments

mean() computes the mean
the variance : var().
the standard deviation sd().
the skewness skewness() (fUtilities, moment or e1071)
the kurtosis : kurtosis() (fUtilities, moment or e1071)
all the moments : moment() (moment) and all.moments() (moment).

> library(moments)
>  x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506

Order statistics

the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of a vector), min() the minimum and max() the maximum.
IQR() computes the interquartile range. median() computes the median and mad() the median absolute deviation.
quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample size is big.

> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
       0.00        0.25        0.50        0.75        1.00 
-3.07328999 -0.66901899  0.02157989  0.72378407  2.92897970 
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
attr(,"class")
[1] "kuantile"

Inequality Index

The gini coefficient : Gini() (ineq) and gini() (reldist).
ineq() (ineq) gives all inequalities index.

> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) #  Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694

Concentration index

> library(ineq)
> Herfindahl(x)
[1] 0.003091162
>  Rosenbluth(x)
[1] 0.002141646

Poverty index

> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index

Plotting the distribution

We can plot the distribution using a box plot (boxplot()), an histogram (hist()), a kernel estimator (plot() with density()) or the empirical cumulative distribution function (plot() with ecdf()). See the Nonparametric section to learn more about histograms and kernel density estimators. qqnorm() produces a normal QQ plot and qqline() adds a line to the QQ plot which passes through the first and the third quartile.

A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the maximum.
stripchart() and stem() are also availables.

> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
 
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one

Goodness of fit tests

Kolmogorov Smirnov Test :

The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the difference between the empirical cumulative distribution function and the theoritical cumulative distribution function. KSd() (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.

> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)

Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test() implements the Shapiro Wilk Normality Test

> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)

         Lilliefors (Kolmogorov-Smirnov) normality test

data:  x 
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x 
W = 0.9916, p-value = 0.7902

Andersen Darling Test :

> library("nortest")
> ad.test(x)

	Anderson-Darling normality test

data:  x 
A = 0.2541, p-value = 0.7247

See also the package ADGofTest for another version of this test^[1].

Shapiro-Francia normality test :

> sf.test(x)

	Shapiro-Francia normality test

data:  x 
W = 0.9866, p-value = 0.9953

Pearson chi-square normality test :

> library("nortest")
> pearson.test(x)

	Pearson chi-square normality test

data:  x 
P = 0.8, p-value = 0.8495

Cramer-von Mises normality test

> cvm.test(x)

	Cramer-von Mises normality test

data:  x 
W = 0.0182, p-value = 0.9756

Jarque-Bera test :

> jarque.bera.test(x)

	Jarque Bera Test

data:  x 
X-squared = 0.6245, df = 2, p-value = 0.7318

Discrete variable

We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart (pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot() (lattice)).

freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.

> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x) 
x 
      Frequency Percent
A            32      32
B            34      34
C            34      34
Total       100     100

Multivariate analysis

Continuous variables

Covariance : cov()
Pearson's linear correlation : cor().
Pearson's correlation test cor.test() performs the test.
Spearman's rank correlation :
- cor() with method = "spearman".
- spearman() (Hmisc)
Spearman's rank correlation test :
- spearman2() (Hmisc)
- spearman.test() (Hmisc)
- spearman.test() (pspearman package) performs the Spearman’s rank correlation test with precomputed exact null distribution for n <= 22.
Kendall's correlation : cor() with method = "kendall". See also the Kendall package.

> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot 
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")

Discrete variables

table(), xtabs() and prop.table() for contingency tables. ftable() (stats package) for a flat (nested) table.
assocplot() and mosaicplot() for graphical display of contingency table.
CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher independence tests.
my.table.NA() and my.table.margin() (cwhmisc)
chisq.detail() (TeachingDemos)

Discrete and Continuous variables

bystats() Statistics by Categories in the Hmisc package
summaryBy() (doBy)
Multiple box plots : plot() or boxplot()

> N <- 100
> x <- sample(1:4,N, replace = T) 
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean) 
> bystats(y , as.factor(x), fun = quantile)

Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality of two distributions ks.test().

N <- 100
x <- sample(0:1,N, replace = T) 
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)

References

↑ Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http://CRAN.R-project.org/package=ADGofTest

Previous: Graphics

Index

Next: Linear Models

Mathematics

Basics

?Arithmetic
?Special

Linear Algebra

Vectors

The inner product

The inner product is also called the dot product or the scalar product. It is the sum of the item-by-item product.

> u <- rep(3,3)
> v <- 1:3
> u%*%v # the inner product
     [,1]
[1,]   18

The outer product

The outer product is also called the cross product or the vector product. It is a matrix resulting from the product of the elements of the two vectors.

> v <- rep(3,3)
> u <- 1:3
> u%o%v # The outer product
     [,1] [,2] [,3]
[1,]    3    3    3
[2,]    6    6    6
[3,]    9    9    9

Matrix Algebra

If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by column (the default option) with byrow. You can also combine vectors using cbind() or rbind(). The dimension of a matrix can be obtained using the dim() function or alternatively nrow() and ncol().

> matrix(data = NA, nrow = 5, ncol = 5, byrow = T)
> matrix(data = 1:15, nrow = 5, ncol = 5, byrow = T)
> v1 <- 1:5
> v2 <- 5:1
> cbind(v1,v2)
> rbind(v1,v2)
> dim(X)
> nrow(X)
> ncol(X)

Some special matrix

The identity matrix has ones on the diagonal and zeros outside the diagonal.

eye() (matlab)
diag(1,nrow=10,ncol=10)
diag(rep(1,10))

J matrix is full of ones

ones() (matlab)

A matrix full of zeros

zeros() (matlab)

> library(matlab)
> eye(3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1
> ones(3)
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1
> zeros(3) 
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0

Diagonal matrix

> diag(3)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Upper triangular

> round(upper.tri(matrix(1, n, n))) 

for n=3
     [,1] [,2] [,3]
[1,]    0    1    1
[2,]    0    0    1
[3,]    0    0    0

If you also need the diagonal of one's 

> round(upper.tri(matrix(1, 3, 3), diag = TRUE))

      [,1] [,2] [,3]
[1,]    1    1    1
[2,]    0    1    1
[3,]    0    0    1

Lower triangular

Same as upper triangular but using lower.tri instead

create an Hilbert matrix using hilbert() (fUtilities).

Matrix calculations

compute a matrix multiplication X%*%Y.

> b <- matrix(nrow = 2, ncol = 2, c(1, 2, 3, 4))
> a <- matrix(nrow = 2, ncol = 2, c(1, 0, 0, -1))
> a
     [,1] [,2]
[1,]    1    0
[2,]    0   -1
> b
     [,1] [,2]
[1,]    1    3
[2,]    2    4
> a%*%b
     [,1] [,2]
[1,]    1    3
[2,]   -2   -4
> b%*%a
     [,1] [,2]
[1,]    1   -3
[2,]    2   -4

compute the Kronecker product using %x% or kron() (fUtilities).

> M <- matrix(rep(2,4),nrow = 2) 
> M
     [,1] [,2]
[1,]    2    2
[2,]    2    2
> I <- eye(2) 
> I
     [,1] [,2]
[1,]    1    0
[2,]    0    1
> I %x% M 
     [,1] [,2] [,3] [,4]
[1,]    2    2    0    0
[2,]    2    2    0    0
[3,]    0    0    2    2
[4,]    0    0    2    2
> library(fUtilities)
> kron(I,M)
     [,1] [,2] [,3] [,4]
[1,]    2    2    0    0
[2,]    2    2    0    0
[3,]    0    0    2    2
[4,]    0    0    2    2

Matrix transposition

Transpose the matrix

> t(M)
     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    0    1    2
[3,]    0    0    1

The trace and determinant of a matrix

compute the trace of a matrix using tr() (fUtilities)
returns the rank of a matrix using rk() (fBasics:)

Matrix inversion

Invert a matrix using solve() or inv() (fUtilities). We can also compute the generalized inverse using ginv() in the MASS package.

> M <- cbind(c(1,0,1),c(0,1,2),c(0,0,1))
> solve(M)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]   -1   -2    1
> solve(M)%*%M
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Solving a linear equation

> m=matrix(nrow=2,ncol=2,c(1,-.8,1,.2))
> m
     [,1] [,2]
[1,]  1.0  1.0
[2,] -0.8  0.2
> 
> l=matrix(c(1.0+25.0/18,25.0/18.0))
> l
         [,1]
[1,] 2.388889
[2,] 1.388889
> 
> k=solve(m,l)
> k
           [,1]
[1,] -0.9111111
[2,]  3.3000000
> 
> m%*%k          #checking the answer
         [,1]
[1,] 2.388889
[2,] 1.388889
>

Eigenvalue, eigenvector and eigenspace

Eigenvalues and eigenvectors

> eigen(M)
$values
[1] 1 1 1

$vectors
     [,1]          [,2]          [,3]
[1,]    0  2.220446e-16  0.000000e+00
[2,]    0  0.000000e+00  1.110223e-16
[3,]    1 -1.000000e+00 -1.000000e+00

Misc

compute the norm of a matrix using norm() (fUtilities).
check if a matrix is positive definite isPositiveDefinite() (fUtilities).
make a matrix positive definite makePositiveDefinite() (fUtilities).

computes row statistics and column statistics (fUtilities).
extract the upper and the lower part of a matrix triang() and Triang() (fUtilities).
See also the matrix, matlab, matrixcalc, matrixStats packages.

Analysis

Logarithm and Exponents

We have the power function 10^3 or 10**3 , the logarithm and the exponential log(2.71), log10(10),exp(1).

> 10^3 # exponent
[1] 1000
> 10**3 # exponent
[1] 1000
> exp(1) # exponential
[1] 2.718282
> log(2.71) # natural logarithm
[1] 0.9969486
> log10(1000) # base 10 logarithm
[1] 3
> log(1000,base = 10) # base 10 logarithm
[1] 3

Polynomial equations

To solve $ax^{k}+bx^{k-1}+\cdots +n=0$ , where $a,b,\dots ,n$ are given numbers, use the command

> polyroot(c(n,...,b,a))

So, for example, to calculate the roots of the equation $2x^{2}-5x-3=0$ one would do as follows:

> polyroot(c(-3,-5,2))
 [1] -0.5+0i  3.0-0i

and the solution can be read to be $x_{1}=-0.5\land x_{2}=3$ .

See also polynom and multipol packages

Derivatives

Symbolic calculations

R can give the derivative of an expression. You need to convert your function as an expression using the expression() function. Otherwise you get an error message.

Here are some examples :

> D(expression(x^n),"x")
x^(n - 1) * n
> D(expression(exp(a*x)),"x")
exp(a * x) * a
> D(expression(1/x),"x")
-(1/x^2)
> D(expression(x^3),"x")
3 * x^2
> D(expression(pnorm(x)),"x")
dnorm(x)
> D(expression(dnorm(x)),"x")
-(x * dnorm(x))

Numerical approximation

numDeriv package

Integration

R can perform one dimensional integration. For example we can integrate over the density of the normal distribution between $-\infty$ and $+\infty$

> integrate(dnorm,-Inf,Inf)
1 with absolute error < 9.4e-05
> integrate(dnorm,-1.96,1.96)
0.9500042 with absolute error < 1.0e-11
> integrate(dnorm,-1.64,1.64)
0.8989948 with absolute error < 6.8e-14
# we can also store the result in an object
> ci90 <- integrate(dnorm,-1.64,1.64)
> ci90$value
[1] 0.8989948
> integrate(dnorm,-1.64,1.64)$value
[1] 0.8989948

see the adapt package for multivariate integration.

> library(adapt)
> ?adapt
> ir2pi <- 1/sqrt(2*pi)
> fred <- function(z) { ir2pi^length(z) * exp(-0.5 * sum(z * z))}
> 
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred)
       value       relerr       minpts       lenwrk        ifail 
    1.039222 0.0007911264          231           73            0 
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-4)
       value       relerr       minpts       lenwrk        ifail 
    1.000237 1.653498e-05          655          143            0 
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-6)
      value      relerr      minpts      lenwrk       ifail 
   1.000039 3.22439e-07        1719         283           0

See also integrate.gh() in the ecoreg package.

Probability

The number of combination of length k within n numbers : $C_{n}^{k}$

> choose(100, 5)
[1] 75287520

Union and intersection

> union(1:10, 5:7)
[1]  1  2  3  4  5  6  7  8  9 10
> intersect(1:10, 5:7)
[1] 5 6 7

Arithmetics

The factorial function

factorial returns the factorial of an integer. This can also be computed using the prod() (product) applied to the vector of integers between 1 and the number of interest.

> factorial(3)
[1] 6
> prod(1:3)
[1] 6

Note that by convention $0!=1$ . factorial() returns 1 in 0. This is not the case with the prod() functions.

> factorial(0)
[1] 1
> prod(0)
[1] 0

Factorial numbers can be very large and cannot be computed for high values.

> factorial(170)
[1] 7.257416e+306
> factorial(171)
[1] Inf
Message d'avis :
In factorial(171) : value out of range in 'gammafn'

The modulo function and euclidian division

Modulo and integer division (i.e. euclidean division)

> 5%%2
[1] 1
>5%/%2
[1] 2

Note: R is affected by the problem with non integer numbers and euclidian divisions.

> .5%/%.1 # we get 4 instead of 5
[1] 4
> .5%%.1 # we get .1 instead of 0
[1] 0.1

Geometry

pi the constant
cos(), sin(), tan() the trigonometric functions.

Symbolic calculus

rSymPy (rsympy) provides sympy (link) functions in R.

If you want to do more symbolic calculus, see Maxima^[1], SAGE^[2], Mathematica^[3]

References

↑ Maxima is open source http://maxima.sourceforge.net/
↑ SAGE is an open source package which includes R and Maxima : http://www.sagemath.org/
↑ Mathematica is not open source http://www.wolfram.com/products/mathematica/index.html

Optimization

Wikipedia has related information at Optimization (mathematics)

See also the Wikibook Statistics/Numerical_Methods/Optimization.

optimize() is devoted to one dimensional optimization problem.
optim(), nlm(), ucminf() (ucminf) can be used for multidimensional optimization problems.
nlminb() for constrained optimization.
quadprog, minqa, rgenoud, trust packages
Some work is done to improve optimization in R. See Updating and improving optim(), Use R 2009 slides^[1], the R-forge optimizer page^[2] and the corresponding packages including optimx.

Numerical Methods

One dimensional problem

The one dimensional problem :

> func <- function(x){
+ 	return ( (x-2)^2 )
+ 	}
> (func(-2))
[1] 16
>
> # plot your function using the 'curve function'
> curve(func,-4,8) 
>
> # Here is another way to plot the function
> # using a grid
> grid <- seq(-10,10,by=.1) 
> func(grid)
> plot(grid,func(grid))
> 
> # you can find the minimum using the optimize function
> optimize(f=func,interval=c(-10,10))
$minimum
[1] 2

$objective
[1] 0

Newton-Raphson

nlm() provides a Newton algorithm.
maxLik package for maximization of a likelihood function. This package includes the Newton Raphson method.
newtonraphson() in the spuRs package.

BFGS

The BFGS method

> func <- function(x){
+ 	out <- (x[1]-2)^2 + (x[2]-1)^2
+ 	return <- out
+ 	}> 
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("BFGS"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("L-BFGS-B"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)

Conjugate gradient method

Wikipedia has related information at Conjugate gradient method

optim() with method="cg".

Trust Region Method

"trust" package for trust region method

The Nelder-Mead simplex method

The Nelder Mead Method

> func <- function(x){
+ 	out <- (x[1]-2)^2 + (x[2]-1)^2
+ 	return <- out
+ 	}
> 
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("Nelder-Mead"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)

The boot package includes another simplex method

Simulation methods

Simulated Annealing

Wikipedia has related information at Simulated annealing

The Simulated Annealing is an algorithm which is useful to maximise non-smooth functions. It is pre implemented in optim().

> func <- function(x){
+ 	out <- (x[1]-2)^2 + (x[2]-1)^2
+ 	return <- out
+ 	}> 
> optim(par=c(0,0), fn=func, gr = NULL,
+       method = c("SANN"),
+       lower = -Inf, upper = Inf,
+       control = list(), hessian = T)

EM Algorithm

Wikipedia has related information at EM algorithm

Genetic Algorithm

Wikipedia has related information at Genetic algorithm

rgenoud package for genetic algorithm^[3]
gaoptim package for genetic algorithm^[4]
ga general purpose package for optimization using genetic algorithms. It provides a flexible set of tools for implementing genetic algorithms search in both the continuous and discrete case, whether constrained or not. ^[5]

References

Citations

↑ Updating and improving optim(), Use R 2009 slides http://www.agrocampus-ouest.fr/math/useR-2009/slides/Nash+Varadhan.pdf
↑ R-forge optimizer http://optimizer.r-forge.r-project.org/
↑ Jasjeet Sekhon homepage : http://sekhon.berkeley.edu/rgenoud/
↑ gaoptim on CRAN: http://cran.r-project.org/web/packages/gaoptim/index.html
↑ ga on CRAN: http://cran.r-project.org/web/packages/GA/index.html/

Sources

Venables and Ripley, Chapter 16.
Cameron and Trivedi, Microeconometrics, Chapter 10.
Braun and Murdoch, first course in statistical programming with R (a very good reference on optimization using R), Chapter 7.

Previous: Mathematics

Index

Next: Probability Distributions

Probability Distributions

This page review the main probability distributions and describe the main R functions to deal with them.

R has lots of probability functions.

r is the generic prefix for random variable generator such as runif(), rnorm().
d is the generic prefix for the probability density function such as dunif(), dnorm().
p is the generic prefix for the cumulative density function such as punif(), pnorm().
q is the generic prefix for the quantile function such as qunif(), qnorm().

Discrete distributions

Benford Distribution

The Benford distribution is the distribution of the first digit of a number. It is due to Benford 1938^[1] and Newcomb 1881^[2].

> library(VGAM)
> dbenf(c(1:9))
[1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679 0.05799195 0.05115252 0.04575749

Bernoulli

We can draw from a Bernoulli using sample(), runif() or rbinom() with size = 1.

> n <- 1000
> x <- sample(c(0,1), n, replace=T)
> x <- sample(c(0,1), n, replace=T, prob=c(0.3,0.7))
> x <- runif(n) > 0.3
> x <- rbinom(n, size=1, prob=0.2)

Binomial

We can sample from a binomial distribution using the rbinom() function with arguments n for number of samples to take, size defining the number of trials and prob defining the probability of success in each trial.

> x <- rbinom(n=100,size=10,prob=0.5)

Hypergeometric distribution

We can sample n times from a hypergeometric distribution using the rhyper() function.

> x <- rhyper(n=1000, 15, 5, 5)

Geometric distribution

The geometric distribution.

> N <- 10000
> x <- rgeom(N, .5)
> x <- rgeom(N, .01)

Multinomial

The multinomial distribution.

> sample(1:6, 100, replace=T, prob= rep(1/6,6))

Negative binomial distribution

The negative binomial distribution is the distribution of the number of failures before k successes in a series of Bernoulli events.

> N <- 100000
> x <- rnbinom(N, 10, .25)

Poisson distribution

We can draw n values from a Poisson distribution with a mean set by the argument lambda.

> x <- rpois(n=100, lambda=3)

Zipf's law

The distribution of the frequency of words is known as Zipf's Law. It is also a good description of the distribution of city size^[3]. dzipf() and pzipf() (VGAM)

> library(VGAM)
> dzipf(x=2, N=1000, s=2)

Continuous distributions

Beta and Dirichlet distributions

Beta distribution
Dirichlet in gtools and MCMCpack

>library(gtools)
>?rdirichlet
>library(bayesm)
>?rdirichlet
>library(MCMCpack)
>?Dirichlet

Cauchy

We can sample n values from a Cauchy distribution with a given location parameter $x_{0}$ (default is 0) and scale parameter $\gamma$ (default is 1) using the rcauchy() function.

> x <- rcauchy(n=100, location=0, scale=1)

Chi Square distribution

Quantile of the Chi-square distribution ( $\chi ^{2}$ distribution)

> qchisq(.95,1)
[1] 3.841459
> qchisq(.95,10)
[1] 18.30704
> qchisq(.95,100)
[1] 124.3421

Exponential

We can sample n values from a exponential distribution with a given rate (default is 1) using the rexp() function

> x <- rexp(n=100, rate=1)

Fisher-Snedecor

We can draw the density of a Fisher distribution (F-distribution) :

> par(mar=c(3,3,1,1))
> x <- seq(0,5,len=1000)
> plot(range(x),c(0,2),type="n")
> grid()
> lines(x,df(x,df1=1,df2=1),col="black",lwd=3)
> lines(x,df(x,df1=2,df2=1),col="blue",lwd=3)
> lines(x,df(x,df1=5,df2=2),col="green",lwd=3)
> lines(x,df(x,df1=100,df2=1),col="red",lwd=3)
> lines(x,df(x,df1=100,df2=100),col="grey",lwd=3)
> legend(2,1.5,legend=c("n1=1, n2=1","n1=2, n2=1","n1=5, n2=2","n1=100, n2=1","n1=100, n2=100"),col=c("black","blue","green","red","grey"),lwd=3,bty="n")

Gamma

We can sample n values from a gamma distribution with a given shape parameter and scale parameter Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "http://localhost:6011/en.wikibooks.org/v1/":): {\displaystyle \theta} using the rgamma() function. Alternatively a shape parameter and rate parameter $\beta =1/\theta$ can be given.

> x <- rgamma(n=10, scale=1, shape=0.4)
> x <- rgamma(n=100, scale=1, rate=0.8)

Levy

We can sample n values from a Levy distribution with a given location parameter $\mu$ (defined by the argument m, default is 0) and scaling parameter (given by the argument s, default is 1) using the rlevy() function.

> x <- rlevy(n=100, m=0, s=1)

Log-normal distribution

We can sample n values from a log-normal distribution with a given meanlog (default is 0) and sdlog (default is 1) using the rlnorm() function

> x <- rlnorm(n=100, meanlog=0, sdlog=1)

Normal and related distributions

We can sample n values from a normal or gaussian Distribution with a given mean (default is 0) and sd (default is 1) using the rnorm() function

> x <- rnorm(n=100, mean=0, sd=1)

Quantile of the normal distribution

> qnorm(.95)
[1] 1.644854
> qnorm(.975)
[1] 1.959964
> qnorm(.99)
[1] 2.326348

The mvtnorm package includes functions for multivariate normal distributions.
- rmvnorm() generates a multivariate normal distribution.

> library(mvtnorm)
> sig <- matrix(c(1, 0.8, 0.8, 1), 2, 2)
> r <- rmvnorm(1000, sigma = sig)
> cor(r) 
          [,1]      [,2]
[1,] 1.0000000 0.8172368
[2,] 0.8172368 1.0000000

Pareto Distributions

Generalized Pareto dgpd() in evd
dpareto(), ppareto(), rpareto(), qpareto() in actuar
The VGAM package also has functions for the Pareto distribution.

Student's t distribution

Quantile of the Student t distribution

> qt(.975,30)
[1] 2.042272
> qt(.975,100)
[1] 1.983972
> qt(.975,1000)
[1] 1.962339

The following lines plot the .975th quantile of the t distribution in function of the degrees of freedom :

curve(qt(.975,x), from = 2 , to = 100, ylab = "Quantile 0.975 ", xlab = "Degrees of freedom", main = "Student t distribution")
abline(h=qnorm(.975), col = 2)

Uniform distribution

We can sample n values from a uniform distribution (also known as a rectangular distribution] between two values (defaults are 0 and 1) using the runif() function

> runif(n=100, min=0, max=1)

Weibull

We can sample n values from a Weibull distribution with a given shape and scale parameter $\mu$ (default is 1) using the rweibull() function.

> x <- rweibull(n=100, shape=0.5, scale=1)

Extreme values and related distribution

The Gumbel distribution
The logistic distribution : distribution of the difference of two gumbel distributions.

plogis, qlogis, dlogis, rlogis

Frechet dfrechet() evd
Generalized Extreme Value dgev() evd
Gumbel dgumbel() evd
Burr, dburr, pburr, qburr, rburr in actuar

Distribution in circular statistics

Functions for circular statistics are included in the CircStats package.
- dvm() Von Mises (also known as the nircular normal or Tikhonov distribution) density function
- dtri() triangular density function
- dmixedvm() Mixed Von Mises density
- dwrpcauchy() wrapped Cauchy density
- dwrpnorm() wrapped normal density.

References

↑ Benford, F. (1938) The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78, 551–572.
↑ Newcomb, S. (1881) Note on the Frequency of Use of the Different Digits in Natural Numbers. American Journal of Mathematics, 4, 39–40.
↑ Gabaix, Xavier (August 1999). "Zipf's Law for Cities: An Explanation". Quarterly Journal of Economics 114 (3): 739–67. doi:10.1162/003355399556133. ISSN 0033-5533. http://pages.stern.nyu.edu/~xgabaix/papers/zipf.pdf.

Previous: Optimization

Index

Next: Random Number Generation

Random Number Generation

Random Number Generators

To a very high degree computers are deterministic and therefore are not a reliable source of significant amounts of random values. In general pseudo random number generators are used. The default algorithm in R is Mersenne-Twister but a long list of methods is available. See the help of RNGkind() to learn about random number generators.

?RNGkind

It is possible to use true random numbers. Some of them are collected on random.org (link). The random (link) package gives an access to them.

Randu

Randu is an old linear congruential pseudorandom number generator. There is a dataset generated with Randu in the datasets package. The function which is used to generate the dataset is in the help of this page.

library("datasets")
?randu

Seed

A pseudo random number generator is an algorithm based on a starting point called "seed". If you want to perform an exact replication of your program, you have to specify the seed using the function set.seed(). The argument of set.seed has to be an integer.

> set.seed(1)
> runif(1)
[1] 0.2655087
> set.seed(1)
> runif(1)
[1] 0.2655087

Sampling in a vector

Toss 10 coins

> sample(0:1,10,replace=T)
 [1] 1 0 0 0 1 0 0 1 1 1

Roll 10 dice

> sample(1:6,10,replace=T)
 [1] 4 1 5 3 2 5 5 6 3 2

play lottery (6 random numbers out of 49 without replacement)

> sample(1:49,6,replace=F)
[1] 18 35 29  1 33 11

You can sample in a multinomial distribution :

>mydat <- sample(1:4,1000,rep=TRUE,prob=c(.2,.3,.2,.3))
>table(mydat)

Sampling in a standard univariate distribution

You can use rnorm, rt, etc.

Misspecified argument

Note that if you put as argument of rnorm a vector instead of a number, R takes by default the length of the vector instead of returning an error. Here is an example :

x <- rnorm(10) # Sample a normal random vector
set.seed(1) # use the seed
z <- rnorm(x) # put a vector instead of a number as an argument of rnorm
set.seed(1) # initialize the seed again
z2 <- rnorm(length(x)) # sample in a vector with the same length as x
plot(z2,z) # check that z and z2 are the same

Inverse CDF method

If you know the inverse CDF (quantile function), you can generate the random variable by sampling in the standard uniform distribution and transforming using the CDF.

For instance, if you want to simulate from a standard normal distribution, you can simulate from a standard uniform and transform it using the quantile function of the normal distribution.

N <- 100
qnorm(runif(N))

This gives the same results as the rnorm() function but the computing time is higher :

> N <- 10^7
> system.time(qnorm(runif(N)))
   user  system elapsed 
   1.67    0.00    1.70 
> system.time(rnorm(N)) 
   user  system elapsed 
   1.50    0.00    1.51

Importance sampling

See Jeff Gill's routine

Metropolis algorithm

Gibbs algorithm

rgs package (link)

Quasi random numbers

quasi RNG

Sometimes you need to generate quasi random sequences. The randtoolbox library provides several quasi random number generators.

See also sHalton() and QUnif() (sfsmisc).

Halton Sequence

> library(randtoolbox)
> halton(10, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
        [,1]       [,2]
 [1,] 0.5000 0.33333333
 [2,] 0.2500 0.66666667
 [3,] 0.7500 0.11111111
 [4,] 0.1250 0.44444444
 [5,] 0.6250 0.77777778
 [6,] 0.3750 0.22222222
 [7,] 0.8750 0.55555556
 [8,] 0.0625 0.88888889
 [9,] 0.5625 0.03703704
[10,] 0.3125 0.37037037

You can compare Halton draws with the standard R (pseudo) random number generator. Halton draws are much more systematic.

>random <- cbind(runif(1000),runif(1000))
>halton <- halton(1000, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
>par(mfrow=c(2,2))
>plot(halton[,1],halton[,2])
>plot(random[,1],random[,2])

Examples

Resources

Revolution Computing entry on pseudo random numbers
Statistical Distributions Module Wessa.net it is online application which generates random numbers using R. You can have access to the R code and use it in your own programs.

References

Previous: Probability Distributions

Index

Next: Control Structures

Maximum Likelihood

Introduction

Maximum likelihood estimation is just an optimization problem. You have to write down your log likelihood function and use some optimization technique. Sometimes you also need to write your score (the first derivative of the log likelihood) and or the hessian (the second derivative of the log likelihood).

One dimension

If there is only one parameter, we can optimize the log likelihood using optimize().

Example with a type 1 Pareto distribution

We provide an example with a type 1 Pareto distribution. Note that in this example we treat the minimum as known and do not estimate it. Therefore this is a one-dimensional problem.

We use the rpareto1() (actuar) function to generate a random vector from a type 1 Pareto distribution with shape equal to 1 and minimum value equal to 500. We use the dpareto1() (actuar) function with option log = TRUE to write the log likelihood. Then we just need to use optimize() with maximum=TRUE. We provide a minimum and a maximum value for the parameter with the interval option.

> library(actuar)
> y <- rpareto1(1000, shape = 1, min = 500)
> ll <- function(mu, x) { 
+    sum(dpareto1(x,mu[1],min = min(x),log = TRUE)) 
+   } 
> optimize(f = ll, x = y, interval = c(0,10), maximum = TRUE)

Multiple dimension

fitdistr() (MASS package) fits univariate distributions by maximum likelihood. It is a wrapper for optim().
If you need to program yourself your maximum likelihood estimator (MLE) you have to use a built-in optimizer such as nlm(), optim(). R also includes the following optimizers :
mle() in the stats4 package
The maxLik package

Example with a logistic distribution

For instance, we draw from a logistic distribution and we estimate the parameters using .

> # draw from a gumbel distribution using the inverse cdf simulation method
> e.1 <- -log(-log(runif(10000,0,1))) 
> e.2 <- -log(-log(runif(10000,0,1)))
> u <- e.2 - e.1  # u follows a logistic distribution (difference between two gumbels.)
> fitdistr(u,densfun=dlogis,start=list(location=0,scale=1))

Example with a Cauchy distribution

For instance, we can write a simple maximum likelihood estimator for a Cauchy distribution using the nlm() optimizer. We first draw a vector x from a Cauchy distribution. Then we define the log likelihood function and then we optimize using the nlm() function. Note that nlm() is minimizer and not a maximizer.

> n <- 100
> x <- rcauchy(n)
> mlog.1 <- function(mu, x) { 
+   - sum(dcauchy(x, location = mu, log = TRUE)) 
+   } 
> mu.start <- median(x)
> out <- nlm(mlog.1, mu.start, x = x)

Example with a beta distribution

Here is an other example with the Beta distribution and the optim() function.

> y <- rbeta(1000,2,2)
> loglik <- function(mu, x) { 
+    sum(-dbeta(x,mu[1],mu[2],log = TRUE)) 
+    } 
> 
> out <- optim(par = c(1,1), fn=loglik,x=y,method = "L-BFGS-B",lower=c(0,0))

Tests

Likelihood Ratio Test

lrtest() in the lmtest package^[1].

Some Specific cases

gum.fit() (ismev package) provides MLE for a Gumbel distributon

Resources

References

↑ Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R News 2(3), 7-10. URL http://CRAN.R-project.org/doc/Rnews/

Previous: Linear Models

Index

Next: Bayesian Methods

Method of Moments

Wikipedia has related information at Generalized method of moments and Method of moments (statistics)

Package gmm implements the generalized method of moment and the generalized empirical likelihood.

First, it is possible to estimate a simple linear model or a simple linear model with instrumental variables using the gmm() function. The GMM method is often used to estimate heteroskedastic instrumental variable models.

> # Simple linear model
> N <- 1000
> u <- rnorm(N)
> x <- 1 + rnorm(N)
> y <- 1 + x + u
> res <- gmm(y ~ x, x)

> # Simple linear model with instrumental variables.
> library(gmm)
> N <- 1000
> u <- rnorm(N)
> z <- rnorm(N)
> x <- 1 + z + u + rnorm(N)
> y <- 1 + x + u
> res <- gmm(y ~ x, z)
> summary(res)

Bayesian Methods

Introduction

R has lots of bayesian estimation procedures, much more than Stata or SAS.

LearnBayes by Jim Albert
bayesm by Peter Rossi and Rob McCulloch
BaM by Jeff Gill
arm package by Jennifer Hill and Andrew Gelman.
MCMCpack package.
mcsm package by Christian Robert and George Casella.
umacs (link) by Jouni Kerman and Andrew Gelman.

Interface with WinBugs

WinBugs/OpenBugs is a popular statistical package for MCMC techniques.
Andrew Gelman has some instruction to use R and WinBugs on his webpage
There is also an interface with JAGS

Resources

Christian Robert and Jean Michel Marin The Bayesian Core (link including slides and R scripts)
Jim Albert : Bayesian Computation with R, Use R!, Springer 2007.
Christian Robert and George Casella Introducing Monte Carlo Methods with R, Use R!, Springer 2009 (link including the mcsm package).
Peter Rossi, Greg Allenby, Robert McCulloch : Bayesian Statistics and Marketing and "bayesm" package.
CRAN task view for bayesian statistics (link)

References

Previous: Maximum Likelihood

Index

Next: Bootstrap

Bootstrap

Wikipedia has related information at Bootstrapping (statistics)

boot package includes functions from the book Bootstrap Methods and Their Applications by A. C. Davison and D. V. Hinkley (1997, CUP)
bootstrap package.

Quick how-to

Do a bootstrap of some data for some function (here, mean):

 b <- boot(data, function(data, id) { mean(data[id]) }, 1000)

Use this to compute a 90%-confidence interval:

 boot.ci(b, .9, type="norm")

References

Instructions for the boot package: http://www.statmethods.net/advstats/bootstrapping.html
Sample using the boot package: http://druedin.com/2012/11/10/bootstrapping-in-r/

Previous: Bayesian Methods

Index

Next: Nonparametric Methods

Multiple Imputation

Multiple imputation of missing data generally includes two steps. First, an imputation step which results in multiple complete datasets. Second, combining the results obtained by applying the chosen technique on each separate dataset. The packages needed for these two steps are not necessary the same, but can be.

"mi" package Andrew Gelman Blog Entry on the "mi" package
"mice" package.
Amelia
mitools by Thomas Lumley
survey and Zelig have support for multiple imputated datasets.

References

Nonparametric Methods

This page deals with a set of non-parametric methods including the estimation of a cumulative distribution function (CDF), the estimation of probability density function (PDF) with histograms and kernel methods and the estimation of flexible regression models such as local regressions and generalized additive models.

For an introduction to nonparametric methods you can have a look at the following books or handout :

Nonparametric Econometrics: A Primer by Jeffrey S. Racine^[1].
Li and Racine's handbook, Nonparametric econometrics^[2].
Larry Wasserman All of Nonparamatric Statistics^[3]

Empirical distribution function

The easiest way to estimate the empirical CDF uses the rank() and the length() functions.
ecdf() computes the empirical cumulative distribution function.
ecdf.ksCI() (sfsmisc) plots the empirical distribution function with confidence intervals.

> N <- 1000
> x <- rnorm(N)
> edf <- rank(x)/length(x)
> plot(x,edf)
> plot(ecdf(x),xlab = "x",ylab = "Distribution of x")
> grid()
> library("sfsmisc")
> ecdf.ksCI(x1)

Density Estimation

Histogram

hist() is the standard function for drawing histograms. If you store the histogram as an object the estimated parameters are returned in this object.

> x <- rnorm(1000)
> hist(x, probability = T) # The default uses Sturges method.
> # Sturges, H. A. (1926) The choice of a class interval.
> # Journal of the American Statistical Association 21, 65–66. 
> hist(x, breaks = "Sturges", probability = T)
> 
> # Freedman, D. and Diaconis, P. (1981) On the histogram as a density estimator: L_2 theory.
> # Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 453–476. 
> # (n^1/3 * range)/(2 * IQR).
> hist(x, breaks = "FD", probability = T)
> 
> # Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605–610. 
> # ceiling[n^1/3 * range/(3.5 * s)].
> hist(x, breaks = "scott", probability = T)
> 
> # Wand, M. P. (1995). Data-based choice of histogram binwidth.
> # The American Statistician, 51, 59–64. 
> library("KernSmooth")
> h <- dpih(x)
> bins <- seq(min(x)-h, max(x)+h, by=h)
> hist(x, breaks=bins, probability = T)

It is also possible to choose the break points.

> x <- rnorm(1000)
> hist(x, breaks = seq(-4,4,.1))

n.bins() (car package) includes several methods to compute the number of bins for an histogram.
histogram() (lattice)
truehist() (MASS)
hist.scott() (MASS) plot a histogram with automatic bin width selection, using the Scott or Freedman–Diaconis formulae.
histogram package.

Kernel Density Estimation

density() estimates the kernel density of a vector.
- Choose the bandwidth selection method with bw.
- Check the sensitivity of the bandwidth choice using adjust. The default is one. It is good practice to look at adjust=.5 and adjust=2.

> x <- rnorm(10^3)
> plot(density(x,bw = "nrd0", adjust = 1, kernel = "gaussian"), col = 1)
> lines(density(x,bw = "nrd0", adjust = .5, kernel = "gaussian"), col = 2)
> lines(density(x,bw = "nrd0", adjust = 2, kernel = "gaussian"), col = 3)
> legend("topright", legend = c("adjust = 1", "adjust = .5", "adjust = 2"), col = 1:3, lty = 1)

- Choose the kernel function with kernel : "gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine".

> x <- rnorm(10^3)
> plot(density(x,bw = "nrd0", adjust = 1, kernel = "gaussian"), col = 1)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "epanechnikov"), col = 2)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "rectangular"), col = 3)
> lines(density(x,bw = "nrd0", adjust = 1, kernel = "triangular"), col = 3)
> legend("topright", legend = c("gaussian", "epanechnikov", "rectangular",  "triangular"), col = 1:4, lty = 1)

tkdensity() (sfsmisc) is a nice function which allow to dynamically choose the kernel and the bandwidth with a handy graphical user interface. This is a good way to check the sensitivity of the bandwidth and/or kernel choice on the density estimation.

> x  <- rnorm(10^3)
> library("sfsmisc")
> tkdensity(x)

kde2d() (MASS) estimates a bivariate kernel density.

> N <- 1000
> x <- rnorm(N)
> y <- 1 + x^2 + rnorm(N)
> dd <-  kde2d(y,x) # estimate the bivariate kernel
> contour(dd) # plot the bivariate density
> image(dd) # another plot the bivariate density

Examples

Local Regression

loess() is the standard function for local linear regression.
lowess() is similar to loess() but does not have a standard syntax for regression y ~ x .This is the ancestor of loess (with different defaults!).
ksmooth() (stats) computes the Nadaraya–Watson kernel regression estimate.
locpoly() (KernSmooth package)
npreg() (np package)
locpol computes local polynomial estimators
locfit local regression, likelihood and density estimation

Examples

Generalized additive semiparametric models (GAM)

gam() (gam)
gam() (mgcv)

> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> y <- 1 + x1^2 + x2^3 + u
> 
> library(gam)
> g1 <- gam(y ~ x1 + x2 ) # Standard linear model
> par(mfrow=c(1,2))
> plot(g1, se = T)
> 
> g1 <- gam(y ~ s(x1) + x2 ) # x1 is locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
> 
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
> 
> library(mgcv)
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)

References

↑ Jeffrey S. Racine Nonparametric Econometrics: A Primer http://socserv.mcmaster.ca/racine/ECO0301.pdf and at the R code examples http://socserv.mcmaster.ca/racine/primer_code.zip
↑ Qi Li, Jeffrey S. Racine, Nonparametric econometrics, Princeton University Press - 2007
↑ Wasserman, Larry, "All of nonparametric statistics", Springer (2007) (ISBN: 0387251456)

Previous: Bootstrap

Index

Next: Quantile Regression

Linear Models

Standard linear model

In this section we present estimation functions for the standard linear model estimated by ordinary least squares (OLS). Heteroskedasticity and endogeneity are treated below. The main estimation function is lm().

Fake data simulations

We first generate a fake dataset such that there is no hetereoskedasticity, no endogeneity and no correlation between the error terms. Therefore the ordinary least square estimator is unbiased and efficient. We choose a model with two variables and take all the coefficients equal to one.

$y_{i}=1+x_{1,i}+x_{2,i}+u_{i}$

> N <- 1000
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- 1 + x1 + rnorm(N)
> y <- 1 + x1 + x2 + u
> df <- data.frame(y,x1,x2)

Least squares estimation

The standard function to estimate a simple linear model is lm().
lsfit() performs the least square procedure but the output is not formatted in fashionable way.
ols() (Design) is another alternative.

We estimate the model using lm(). We store the results in fit and print the result using summary() which is the standard function.

> fit <- lm(y ~ x1 + x2, data = df)
> summary(fit)

There are some alternative to display the results.

display() in the arm package is one of them.
coefplot() (arm) graphs the estimated coefficients with confidence intervals. This is a good way to present the results.
mtable() in the memisc package can display the results of a set of regressions in the same table.

> library("arm")
> display(fit)
> coefplot(fit)

fit is a list of objects. You can see the list of these objects by typing names(fit). We can also apply functions to fit.

We can get the estimated coefficients using fit$coeff or coef(fit).

> fit$coeff
(Intercept)          x1          x2 
  1.2026522   0.8427403   1.5146775
> coef(fit)
(Intercept)          x1          x2 
     0.7541      1.7844      0.7222 
> output <- summary(fit)
> coef(output) 
             Estimate Std. Error  t value    Pr(>|t|)
(Intercept) 1.1945847  0.2298888 5.196359 0.001258035
x1          0.6458170  0.3423214 1.886581 0.101182585
x2          0.6175165  0.2083628 2.963660 0.020995713

se.coef() (arm) returns the standard error of the estimated coefficients.

The vector of fitted values can be returned via fit$fitted, fitted(fit) or the predict() function. The predict() function also returns standard error and confidence intervals for predictions.

 
> fit$fitted
> fitted(fit)

The vector of residuals:

> fit$resid
> residuals(fit)

The number of degrees of freedom :

> fit$df

Confidence intervals

We can get the confidence intervals using confint() or conf.intervals() in the alr3 package.

> confint(fit, level = .9)
                   5 %     95 %
(Intercept) -0.7263261 1.200079
x1          -0.5724022 1.909924
x2           0.6185011 2.475079
> confint(fit, level = .95)
                 2.5 %   97.5 %
(Intercept) -0.9652970 1.439050
x1          -0.8803353 2.217858
x2           0.3881923 2.705388
> confint(fit, level = .99)
                 0.5 %   99.5 %
(Intercept) -1.5422587 2.016012
x1          -1.6237963 2.961319
x2          -0.1678559 3.261436
> library(alr3)
> conf.intervals(fit)
                 2.5 %   97.5 %
(Intercept) -0.9652970 1.439050
x1          -0.8803353 2.217858
x2           0.3881923 2.705388

Tests

coeftest() (lmtest) performs the Student t test and z test on coefficients.

> library("lmtest")
> coeftest(fit) # t-test
> coeftest(fit,df=Inf) # z-test (for large samples)

linear.hypothesis() (car) performs a finite sample F test on a linear hypothesis or an asymptotic Wald test using $\mathrm {X} ^{2}$ statistics.

> library("car")
> linear.hypothesis(fit,"x1 = x2") # tests Beta1 = Beta2
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(1,3)) # Tests  Beta0 = Beta1 = Beta2 = 1
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(0,3)) # Tests  Beta0 = Beta1 = Beta2 = 0
> linear.hypothesis(fit,c("x1","x2"),rep(0,2)) # Tests Beta1 = Beta2 = 0

See also waldtest() (lmtest) for nested models.

Analysis of variance

We can also make an analysis of variance using anova().

> anova(fit)

Model Search and information criteria

> # Akaike Information Criteria
> AIC(fit)
[1] 26.72857
> # Bayesian Information Criteria
> AIC(fit,k=log(N))
[1] 27.93891

The stats4 package includes AIC() and BIC() function:

> library(stats4)
> ?BIC
> lm1 <- lm(Fertility ~ . , data = swiss)
> AIC(lm1)
[1] 326.0716
> BIC(lm1)
[1] 339.0226

The step() functions performs a model search using the Akaike Information Criteria.

> N <- 10^3
> u <- rnorm(N)
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1
> x3 <- rnorm(N)
> y <- 1+ x1 + x2 + u
> fit <- lm(y~x1+x2 + x3)
> step.fit <- step(fit)

Zelig

The method is also supported in Zelig

> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> z.out <- zelig(y ~  x, model = "ls", data = mydat)
> x.out <- setx(z.out, x = 10)
> s.out <- sim(z.out, x.out)
> summary(s.out)

Bayesian estimation

MCMCregress() (MCMCpack)
BLR() (BLR)

> N <- 1000
> u <- rnorm(N)
> x <- rnorm(N)
> y <- 1 + x + u
> mydat <- data.frame(y,x)
> 
> posterior <- MCMCregress(y ~ x, data = mydat)
> summary(posterior)
> plot(posterior)

Heteroskedasticity

See the lmtest and sandwich packages.
gls() (nlme) computes the generalized least squares estimator.
See "Cluster-robust standard errors using R" (pdf) by Mahmood Arai. He suggests two functions for cluster robust standard errors. clx() allow for one-way clustering and mclx() for two-way clustering. They can be loaded with the following command source("http://people.su.se/~ma/clmclx.R").

> N <- 10 # 10 people
> T <- 5 # 5 times
> id <- rep(1:N,T)
> f <- rep(rnorm(N),T) # is individual specific
> u <- rnorm(N*T)
> x1 <- rnorm(N*T) 
> x2 <- rnorm(N*T) + x1
> y <- 1 + x1 + x2 + f + u
> fit <- lm(y ~ x1 + x2 )
> source("http://people.su.se/~ma/clmclx.R")
> clx(fit, 1, id)

Robustness

Cook's distance

>library(car)
> cookd(fit)
           1            2            3            4            5 
0.0006205008 0.0643213760 0.2574810866 1.2128206779 0.2295047699 
           6            7            8            9           10 
0.3130578329 0.0003365221 0.0671830241 0.0048474954 0.0714255871

Influence plot:

> influence.plot(fit)

Leverage plots:

> leverage.plot(fit,term.name=x1)
> leverage.plot(fit,term.name=x2)

Bonferroni's outlier test:

> outlier.test(fit)

max|rstudent| = 2.907674, degrees of freedom = 6,
unadjusted p = 0.02706231, Bonferroni p = 0.2706231

Observation: 3

See also outlier.t.test() in the alr3 package.

inf.index() in the alr3 package computes all the robustness statistics (Cook's distance, studentized residuals, outlier test, etc)
rlm() performs a robust estimation

See UCLA example
See also the robustbase package

Instrumental Variables

ivreg() in the AER package^[1]
tsls() in the sem package.
It is also possible to use the gmm() command in the gmm package. See Methods of moments for an example.

Fake data simulations

We first simulate a fake data set with x correlated to u, z and u independent and x correlated with z. Thus x is an endogenous explanatory variable of y and z is a valid instrument for x.

> N <- 1000
> z <- rnorm(N)
> u <- rnorm(N) 
> x <- 1 + z + u + rnorm(N) # x is correlated with the error term u (endogeneity) and the instrument z
> y <- 1 + x + u

Two stage least squares

Then we estimate the model with OLS (lm()) and IV using z as an instrument for x.

> ols <- lm(y ~ x)
> summary(ols) # ols are biased
> library("AER")
> iv <- ivreg(y ~ x | z)
> summary(iv) # IV estimates are unbiased
> library("sem")
> iv2 <- tsls(y  ~ x, instruments = ~ z)
> summary(iv2)
> library("gmm")
> iv3 <- gmm(y ~ x, z)
> summary(iv3)

We plot the results :

> plot(y ~ x, col = "gray")
> abline(a  = 1,b = 1, lty = 1, col = 1, lwd = 2)
> abline(ols,  lty = 2, col = 2 , lwd = 2)
> abline(iv, lty = 3, col = 3, lwd = 2)
> legend("topleft", legend = c("True values","OLS","IV"), col = 1:3, lwd = rep(2,3), lty = 1:3)

Panel Data

plm() (plm) implements the standard random effect, fixed effect, first differences methods^[2]. It is similar to Stata's xtreg command.

Note that plm output are not compatible with xtable() and mtable() for publication quality output.

lme4 and gee implements random effect and multilevel models.
See also BayesPanel

Random effects model

To implement a random effects model we generate a fake data set with 1000 observations over 5 time periods.

> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) 
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])

We estimate the random effect model with the plm() function and the model = "random" option.

> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> # panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> re <- plm(eq, model = "random", data=panel)
> summary(re)

Fixed effects model

For a fixed effects model we generate a fake dataset and we correlate the fixed effects f with covariates :

> N <- 1000
> T <- 5
> library(mvtnorm)
> sig <- diag(rep(1,T))
> r <- rmvnorm(N, sigma = sig)
> wide <- data.frame(id = 1:N,f = rnorm(N), u = r)
> long <- reshape(wide, varying = list(3:7), v.names = "u", direction = "long", timevar = "year")
> long$x1 <- 1 + rnorm(N*T) + long$f
> long$x2 <- 1 + rnorm(N*T) + long$x1
> long$y <- 1 + long$x1 + long$x2 + long$f + long$u
> head(long[order(long$id),])

We first transform our data in a plm data frame using plm.data(). We estimate the fixed model using plm() with model = "within" as an option. Then, we compare the estimate with the random effect model and perform an Hausman test. At the end, we plot the density of the fixed effects.

> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> #panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> fe <- plm(eq, model = "within", data=panel)
> summary(fe)
> re <- plm(eq, model = "random", data=panel)
> summary(re)
> phtest(fe, re)
> plot(density(fixef(fe)))
> rug(fixef(fe))

Dynamic panel data

pgmm() (plm) implements the Arellano Bond estimation procedure^[3]. It is similar to xtabond2 in Stata^[4].

Simultaneous equations model

For a [:w:Simultaneous_equations_model|simultaneous equations model] the following packages are needed :

sem package
systemfit package

References

↑ Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER
↑ Yves Croissant, Giovanni Millo (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software 27(2). URL http://www.jstatsoft.org/v27/i02/.
↑ M Arellano, S Bond "Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations" - The Review of Economic Studies, 1991
↑ David Roodman, XTABOND2: Stata module to extend xtabond dynamic panel data estimator, http://ideas.repec.org/c/boc/bocode/s435901.html

External links

Visualization of regression coefficients

Previous: Descriptive Statistics

Index

Next: Maximum Likelihood

Quantile Regression

Quantile regression is a very old method which has become popular only in the last years thanks to computing progress. One of the main researcher in this area is also a R practitioner and has developed a specific package for quantile regressions (quantreg)^[1] ·^[2].

In theory, Quantile regression are also linear and thus could have been included in the Linear regression page. However, this is a very specific topic and we think that it is worth writing a specific page for this topic.

Simple quantile model

We simulate from a simple quantile model. We first generate a uniform error term u and a covariate x.

N <- 10^3
u <- runif(N)
x <- 1 + rnorm(N)
y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x

We estimate the quantile model for some values of tau (the quantile) and plot the coefficients :

q1 <- rq(y ~ x, tau = seq(.1,.9,.1))
summary(q1)
plot(q1)

We then plot the scatterplot, the predicted values using a standard linear model and the predicted values using a quantile linear model :

plot(x,y, col = "grey")
m1 <- lm(y ~ x)
abline(m1, col = "red")
taus <- seq(.1,.9,.1)
for (i in 1:length(taus)){
	abline(rq(y ~ x, tau = taus[i]), col = "blue")
	}
grid()

We can also estimate the model for all quantiles at the same time :

q2 <- rq(y ~ x, tau = -1)
plot(q2, nrow = 2, ncol = 1)

Computing time

For large data sets it is better to use the "fn" or "pfn" method.

> N <- 10^5
> u <- runif(N)
> x <- 1 + rnorm(N)
> y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x
> system.time(rq(y ~ x, tau = .5, method = "br"))
   user  system elapsed 
   1.48    0.00    1.48 
> system.time(rq(y ~ x, tau = .5, method = "fn"))
   user  system elapsed 
   0.60    0.00    0.61 
> system.time(rq(y ~ x, tau = .5, method = "pfn")) 
   user  system elapsed 
   0.30    0.00    0.29

Resources

Koenker, Roger (2005) Quantile Regression, Cambridge University Press. ISBN 0-521-60827-9

References

↑ Roger Koenker (2010). quantreg: Quantile Regression. R package version 4.50. http://CRAN.R-project.org/package=quantreg
↑ Roger Koenker's personal webpage

Previous: Nonparametric Methods

Index

Next: Binomial Models

Binomial Models

In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory variables.

This kind of model can be analyzed using a linear probability model. However a drawback of this model for the parameter of the Bernoulli distribution is that, unless restrictions are placed on $\beta$ , the estimated coefficients can imply probabilities outside the unit interval $[0,1]$ . For this reason, models such as the logit model or the probit model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models page.

Logit model

The model takes the form : $y_{i}\sim Bernoulli(\pi _{i})$ with the inverse link function : $\pi _{i}={\frac {exp(x_{i}'\beta )}{(1+exp(x_{i}'\beta ))}}$ . It can be estimated using maximum likelihood or using bayesian methods.

Fake data simulations

> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> df <- data.frame(y,x)

Maximum likelihood estimation

The standard way to estimate a logit model is glm() function with family binomial and link logit.
lrm() (Design) is another implementation of the logistic regression model.
There is an implementation in the Zelig package^[1].

In this example, we simulate a model with one continuous predictor and estimate this model using the glm() function.

> res <- glm(y ~ x , family  = binomial(link=logit))
> summary(res) # results
> confint(res) # confindence intervals
> names(res) 
> exp(res$coefficients) # odds ratio
> exp(confint(res)) # Confidence intervals for odds ratio (delta method)
> predict(res) # prediction on a linear scale
> predict(res, type = "response") # predicted probabilities
> plot(x, predict(res, type = "response")) # plot the predicted probabilities

Zelig

The Zelig' package makes it easy to compute all the quantities of interest.

We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we estimate the model using zelig() with the model = "logit" option.

We the look at the predicted values of y at the mean of x1 and x2
Then we look at the predicted values when x1 = 0 and x2 = 0
We also look at what happens when x1 changes from the 3rd to the 1st quartile.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
> 
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest

> # the same with other values
> x.out <- setx(z.out, x1 = 0, x2 = 0)
> s.out <- sim(z.out, x = x.out)
> summary(s.out)

> # What happens if x1 change from the 3rd quartile to the 1st quartile ? 
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2)) 
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2)) 
> s.out2<-sim(z.out, x=x.high, x1=x.low) 
> plot(s.out2)

ROC Curve in the verification package.
Zelig has a rocplot() function.

See UCLA Statistical Computing example

Bayesian estimation

bayesglm() in the arm package
MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.

> # Data generating process
> x <- 1 + rnorm(1000,1) 
> xbeta <- -1  + (x* 1)
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> table(y)
> 
> library(MCMCpack)
> res <- MCMClogit(y ~ x)
> summary(res)

> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)

Probit model

The probit model is a binary model in which we assume that the link function is the cumulative density function of a normal distribution.

We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter). Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and we draw the binary variable y as Bernouilli random variable.

> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1  + x1 + x2
> proba <- pnorm(xbeta)
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)

Maximum likelihood

We can use the glm() function with family=binomial(link=probit) option or the probit() function in the sampleSelection package which is a wrapper of the former one.

> res <- glm(y ~ x1 + x2 , family = binomial(link=probit), data = mydat)
> summary(res)
> 
> library("sampleSelection")
> probit(y ~ x1 + x2, data = mydat)
> summary(res)

Bayesian estimation

MCMCprobit() (MCMCpack)

> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)

Semi-Parametric models

Klein and Spady estimator^[3] is implemented in the np package^[4] (see npindex() with method = "kleinspady" option).

References

↑ Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig
↑ UCLA statistical computing probit example http://www.ats.ucla.edu/stat/R/dae/probit.htm
↑ Klein, R. W. and R. H. Spady (1993), “An efficient semiparametric estimator for binary response models,” Econometrica, 61, 387-421.
↑ Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.

Previous: Quantile Regression

Index

Next: Multinomial Ordered Models

Multinomial Models

Multinomial Logit

mlogit package.
mnlogit package
Bayesm package
multinom() nnet
multinomial(), which is used by vglm() VGAM

Conditional Logit

clogit() in the survival package
mclogit package.

Multinomial Probit

mprobit package ^[1]
MNP package to fit a multinomial probit.

Multinomial ordered logit model

We consider a multinomial ordered logit model with unknown thresholds. First, we simulate fake data. We draw the residuals in a logistic distribution. Then we draw some explanatory variable x and we define ys the latent variable as a linear function of x. Note that we set the constant to 0 because the constant and the thresholds cannot be identified simultaneously in this model. So we need to fix one of the parameters. Then, we define thresholds (-1,0,1) and we define our observed variable y using the cut() function. So y is an ordered multinomial variable.

N <- 10000
u <- rlogis(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(y,x)

Maximum likelihood estimation

This model can be estimated by maximum likelihood using the polr() function in the MASS package. Since it is not possible to achieve identification of the constant and the thresholds, R assumes by default that the constant is equal to 0.

library(MASS)
fit <- polr(y  ~ x, method = "logistic", data = df)
summary(fit)

Bayesian estimation

bayespolr() (arm) performs a bayesian estimation of the multinomial ordered logit

library("arm")
fit <- bayespolr(y ~ x, method = "logistic", data = df)
summary(fit)

Multinomial ordered probit model

We generate fake data by drawing an error term in normal distribution and cutting the latent variables in 4 categories.

N <- 1000
u <- rnorm(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(x,y)

Maximum likelihood estimation

The model can be fitted using maximum likelihood method. This can be done using the polr() function in the MASS package with the probit method.

library(MASS)
fit <- polr(y  ~ x, method = "probit", data = df)
summary(fit)

Bayesian estimation

bayespolr() (arm) performs a bayesian estimation of the multinomial ordered probit

Rank Ordered Logit Model

This model was introduced in econometrics by Beggs, Cardell and Hausman in 1981.^[2]^[3] One application is the Combes et alii paper explaining the ranking of candidates to become professor.^[3] Is is also known as Plackett–Luce model in biomedical literature or as exploded logit model in marketing.^[3]

Conditionally Ordered Hierarchical Probit

The Conditionally Ordered Hierarchical Probit can be estimated using the anchors package developped by Gary King and his coauthors^[4].

References

↑ Harry Joe, Laing Wei Chou and Hongbin Zhang (2006). mprobit: Multivariate probit model for binary/ordinal response. R package version 0.9-2.
↑ Beggs, S; Cardell, S; Hausman, J (1981). "Assessing the potential demand for electric cars". Journal of Econometrics. 17: 1–19. doi:10.1016/0304-4076(81)90056-7.
↑ ^a ^b ^c Combes, Pierre-Philippe; Linnemer, Laurent; Visser, Michael (2008). "Publish or peer-rich? The role of skills and networks in hiring economics professors". Labour Economics. 15 (3): 423–41. doi:10.1016/j.labeco.2007.04.003.
↑ Jonathan Wand, Gary King, Olivia Lau (2009). anchors: Software for Anchoring Vignette Data. Journal of Statistical Software, Forthcoming. URL http://www.jstatsoft.org/.

Tobit And Selection Models

Tobit (type 1 Tobit)

In this section, we look at simple tobit model where the outcome variable is observed only if it is above or below a given threshold.

tobit() in the AER package^[1]. This is a wrapper for survreg().

N <- 1000
u <- rnorm(N)
x <- - 1 + rnorm(N)
ystar <- 1 + x + u
y <- ystar*(ystar > 0)
hist(y)

ols <- lm(y ~ x)
summary(ols)
#Plot a correlation matrix and scatter plot
library(GGally)
library(ggplot2)
library(ggfortify)
ggcorr(DATA)
ggpairs(DATA)
#
M<lm(y~.)
library(ggfortify)
autoplot(M, label.size = 3)
#












library(AER)
tobit <- tobit(y ~ x,left=0,right=Inf,dist = "gaussian")

Selection models (type 2 tobit or heckit)

In this section we look at endogenous selection process. The outcome y is observe only if d is equal to one with d a binary variable which is correlated with the error term of y.

heckit() and selection() in sampleSelection ^[2]. The command is called heckit() in honor of James Heckman^[3].

N <- 1000
u <- rnorm(N)
v <- rnorm(N)
x <- - 1 + rnorm(N)
z <- 1 + rnorm(N)
d <- (1 + x + z + u + v> 0)
ystar <- 1 + x + u
y <- ystar*(d == 1)
hist(y)

ols <- lm(y ~ x)
summary(ols)

library(sampleSelection)
heckit.ml <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "ml")
summary(heckit.ml)

heckit.2step <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "2step")
summary(heckit.2step)

Multi-index selection models

In this section we look at endogenous selection processes in matching markets. Matching is concerned with who transacts with whom, and how. For example, which students attend which college. The outcome y is observed only for equilibrium student-college pairs (or matches). These matches are indicated with d equal to one with d a binary variable which is correlated with the error term of y.

stabit() and stabit2() in matchingMarkets.^[4]^[5] The command is called stabit() in reference to the application in stable matching markets.

Simulate two-sided matching data for 20 markets (m=20) with 100 students (nStudents=100) per market and 20 colleges with quotas of 5 students, each (nSlots=rep(5,20)). True parameters in selection and outcome equations are all equal to 1.

library(matchingMarkets)
xdata <- stabsim2(m=20, nStudents=100, nSlots=rep(5,20),
  colleges = "c1",
  students = "s1",
  outcome = ~ c1:s1 + eta + nu,
  selection = ~ -1 + c1:s1 + eta
)

Observe the bias from sorting between students and colleges.

lm1 <- lm(y ~ c1:s1, data=xdata$OUT)
summary(lm1)

Correct for sorting bias by running the Gibbs sampler in Sorensen (2007).^[6]

fit2 <- stabit2(OUT = xdata$OUT,
           colleges = "c1",
           students = "s1",
           outcome = y ~ c1:s1, 
           selection = ~ -1 + c1:s1,
           niter=1000
)
summary(fit2)

Truncation

truncreg package
DTDA "An R package for analyzing truncated data" pdf.

References

↑ Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER
↑ Sample Selection Models in R: Package sampleSelection http://www.jstatsoft.org/v27/i07
↑ James Heckman "Sample selection bias as a specification error", Econometrica: Journal of the econometric society, 1979
↑ Klein, T. (2015). "Analysis of Stable Matchings in R: Package matchingMarkets" (PDF). Vignette to R Package matchingMarkets.
↑ "matchingMarkets: Analysis of Stable Matchings". R Project.
↑ Sorensen, M. (2007). "How Smart is Smart Money? A Two-Sided Matching Model of Venture Capital". Journal of Finance. 62 (6): 2725–2762.

Count Data Models

Wikipedia has related information at Count data

The Poisson model

Wikipedia has related information at Poisson regression

Fake data simulations

We assume that y follows a poisson distribution with mean exp(1 + 1 * x). We store the data in the "df" dataframe.

N <- 1000
x <- rnorm(N)
alpha <- c(1,1)
y <- rpois(N,exp(alpha[1] + alpha[2] * x))
df <- data.frame(x,y)
plot(x,y)

Maximum likelihood

We estimate this simple model using the glm() function with family = poisson as option.

fit <- glm(y ~ x, family = poisson, data = df)
summary(fit)

Bayesian estimation

The model can also be estimated using bayesian methods with the MCMCpoisson() function which is provided in the MCMCpack.

library("MCMCpack")
posterior <- MCMCpoisson(y ~ x, data = df)
plot(posterior)
summary(posterior)

Overdispersion test

dispersiontest() (AER package) provides a test for equidispersion.

Zero inflated model

See the zic package^[1]

Bivariate poisson regression

bivpois package for bivariate poisson regression.

References

See UCLA website for an example
Zeileis, A., Kleiber, C. and Jackman, S. Regression Models for Count Data in R
Replication files for Cameron and Trivedi's 1998 book^[2] are provided in the AER package^[3]. You can simply type ?CameronTrivedi1998 and you will find the source code.

↑ Markus Jochmann (2010). zic: Bayesian Inference for Zero-Inflated Count Models. R package version 0.5-3. http://CRAN.R-project.org/package=zic
↑ Cameron, A.C. and Trivedi, P.K. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press.
↑ Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL http://CRAN.R-project.org/package=AER

Duration Analysis

Using R for Survival Analysis (pdf)
See the survival package
bootkm() Bootstrap Kaplan-Meier Estimates in Hmisc package
event.chart() Flexible Event Chart for Time-to-Event Data in the Hmisc package

References

Previous: Count Data Models

Index

Next: Time Series

Time Series

Introduction

In the following examples we will use the data set Mpyr which is included in the R-package Ecdat, which can be loaded into R and viewed in R by the following code.

#Installs the package Ecdat.
install.packages("Ecdat")
#Loads the packages Ecdat.
library(Ecdat)
#Attached the dataset Mpyr.
data(Mpyr)
#Shows the dataset Mpyr.
Mpyr
Time Series:
Start = 1900 
End = 1989 
Frequency = 1 
            m        p         y         r
1900 1.718774 2.092641 0.9030195  4.380000
1901 1.856318 2.086574 1.0131038  4.280000
1902 1.936512 2.120476 1.0114817  4.920000

Creating time-series objects

The function ts() is used to create time-series objects.
The function as.ts() coerces an object to a time-series.
The function is.ts() tests whether an object is a time-series.

Example:

> data.a<-seq(1,24,by=1)
> is.ts(data.a)
[1] FALSE
> ts(data.a, start=c(2005,1), frequency=12) 
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005   1   2   3   4   5   6   7   8   9  10  11  12
2006  13  14  15  16  17  18  19  20  21  22  23  24
> data.b<-seq(1,24,by=1)
> is.ts(data.b)
[1] FALSE
> is.ts(as.ts(data.b))
[1] TRUE

Creating lagged and differenced variables

The function lag() creates a lagged variable.
The function diff() creates a differenced variable.

Example:

> data.a<-seq(1,12,by=1)
> ts.a<-ts(data.a, start=c(2005,1), frequency=4)
> lag.a<-lag(ts.a,k=1)
> diff.a<-diff(ts.a,lag=1,difference=1)
> ts.a
     Qtr1 Qtr2 Qtr3 Qtr4
2005    1    2    3    4
2006    5    6    7    8
2007    9   10   11   12
> lag.a
     Qtr1 Qtr2 Qtr3 Qtr4
2004                   1
2005    2    3    4    5
2006    6    7    8    9
2007   10   11   12     
> diff.a
     Qtr1 Qtr2 Qtr3 Qtr4
2005         1    1    1
2006    1    1    1    1
2007    1    1    1    1

Plotting time-series objects

The function plot.ts() is used for plotting time-series objects.

Fit Autoregressive Models to Time-series by OLS

In order to fit an autoregressive time series model to the data by ordinary least squares it is possible to use the function ar.ols() which is part of the "stats" package.

Autocorrelation function

The function acf() computes (and by default plots) estimates of the autocovariance or autocorrelation function. Function pacf() is the function used for the partial autocorrelations. Function ccf() computes the cross-correlation or cross-covariance of two univariate series.^[1]

Useful R-packages

fBasics, tis, zoo, tseries, xts, urca, forecast

References

↑ http://www.inside-r.org/r-doc/stats/acf

http://cran.r-project.org/web/views/TimeSeries.html

http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf

Factor Analysis

Wikipedia has related information at Factor analysis

Introduction

Factor analysis is a set of techniques to reduce the dimensionality of the data. The goal is to describe the dataset with a smaller number of variables (ie underlying factors). Factor Analysis was developed in the early part of the 20th century by L.L. Thurstone and others. Correspondence analysis was originally developed by Jean-Paul Benzécri in the 60's and the 70's. Factor analysis is mainly used in marketing, sociology and psychology. It is also known as data mining, multivariate data analysis or exploratory data analysis.

There are three main methods. Principal Component Analysis deals with continuous variables. Correspondence Analysis deals with a contingency table (two qualitative variables) and Multiple correspondence analysis is a generalization of the correspondence analysis with more than two qualitative variables. The major difference between Factor Analysis and Principal Components Analysis is that in FA, only the variance which is common to multiple variables is analysed, while in PCA, all of the variance is analysed. Factor Analysis is a difficult procedure to use properly, and is often misapplied in the psychological literature. One of the major issues in FA (and PCA) is the number of factors to extract from the data. Incorrect numbers of factors can cause difficulties with the interpretation and analysis of the data.

There are a number of techniques which can be applied to assess how many factors to extract. The two most useful are parallel analysis and the minimum average partial criterion. Parallel analysis works by simulating a matrix of the same rank as the data and extracting eigenvalues from the simulated data set. The point at which the simulated eigenvalues are greater than those of the data is the point at which the "correct" number of factors have been extracted. The Minimum Average Partial criterion uses a different approach but can often be more accurate. Simulation studies have established these two methods as the most accurate. Both of these methods are available in the psych package under the fa.parallel and the VSS commands.

Another issue in factor analysis is which rotation (if any) to choose. Essentially, the rotations transform the scores such that they are more easily interpretable. There are two major classes of rotations, orthogonal and oblique. Orthogonal rotations assume that the factors are uncorrelated, while oblique rotations allow the factors to correlate (but do not force this). Oblique rotations are recommended by some (e.g. MacCallum et al 1999) as an orthogonal solution can be obtained from an oblique rotation, but not vice versa.

One of the issues surrounding factor analysis is that there are an infinite number of rotations which explain the same amount of variance, so it can be difficult to assess which model is correct. In response to such concerns, Structural Equation Modelling (SEM), which is also known as Confirmatory Factor Analysis (CFA) was developed by Joreskeg in the 1970's. The essential principle of SEM is that given a model, it attempts to reproduce the observed covariance matrix seen in the data. The ability of a model to reproduce the data can be used as a test of that model's truth. SEM is implemented in R in the sem and lavaan packages, as well as the OpenMx package (which is not available on CRAN).

See the following packages : FactoMineR (website), amap, ade4, anacor, vegan, '"psych"'

Principal Component Analysis (PCA)

PCA deals with continuous variables

prcomp() in the stats package.
princomp() in the stats package.
PCA() (FactoMineR)
See also factanal()
See also fa and prcomp in the psych package

N <- 1000
factor1 <- rnorm(N)
factor2 <- rnorm(N) 
x1 <- rnorm(N) + factor1
x2 <- rnorm(N) + factor1
x3 <- rnorm(N) + factor2 
x4 <- rnorm(N) + factor2
mydat <- data.frame(x1,x2,x3,x4)
pca <- prcomp(mydat)
names(pca)
plot(pca) # plot the eigenvalues
biplot(pca) # A two dimensional plot

pca2 <- princomp(mydat)
biplot(pca2)

pca2 <- princomp(~ x1 + x2 + x3 + x4, data = mydat) # princomp with a formula syntax

Correspondence Analysis (CA)

Correspondence analysis is a tool for analyzing contingency tables.

corresp() MASS
Michael Greenacre's ca package (JSS article)
Correspondence Analysis and Related Network (link)
Quick-R's page (link)
Simple and Canonical Correspondence Analysis Using the R Package anacor (pdf, JSS article)
multiv

Multiple Correspondence Analysis (MCA)

Wikipedia has related information at Multiple correspondence analysis

References

Previous: Time Series

Index

Next: Network Analysis

Ordination

Overview

This page provides basic code for creating a distance matrix and running and plotting a Non-metric Multidimensional Scaling (NMDS) ordination.

Read more about Ordination on Wikipedia.

This code relies on package vegan in R by Jari Oksanen.

Data

First, import data and load required libraries:

require(MASS)
require(vegan)
data(varespec)   # species data
data(varechem)   # environmental data

Distance matrix

bray <- vegdist(varespec, method = "bray")				# calculate a distance matrix

# There are many distance measure options for 'dist', 
# discoverable by running '?dist'. Common distance measures include:
       # 'bray' = Bray-Curtis
       # 'canb' = Canberra
       # 'euclidean' = Euclidean

Unconstrained Ordination

Displaying dissimilarity using NMDS

NMDS analysis and plotting:

nmds <- metaMDS(varespec, k = 2, 
          distance = 'bray', autotransform = FALSE) 	# semi-black box NMDS function

ordiplot(nmds, type = "text")			      # Plot NMDS ordination
fit <- envfit(nmds, varechem[ ,1:4])			   # Calculates environmental vectors
fit						        # Lists vector endpoint coordinates and r-squared values
plot(fit)						   # adds environmental vectors
# a linear representation of environmental variables is not always appropiate
# we could also add a smooth surface of the variable to the plot
ordisurf(nmds, varechem$N, add = TRUE, col = "darkgreen")
nmds$stress                                             # stress value

In the metaMDS function, k is user-defined and relates to how easily the projection fits the dataframe when constrained to k dimensions. Conventional wisdom seems to suggest that stress should not exceed 10-12%. Stress is reduced by increasing the number of dimensions. However, increasing dimensionality might decrease the "realism" of a 2-dimensional plot of the first two NMDS axes.

We can also run a nMDS with 3 dimensions, fit environmental vectors and create a dynamic graph:

nmds3d <- metaMDS(varespec, k = 3, 
  distance = 'bray', autotransform = FALSE)              # run nmds with 3 dimensions
nmds3d$stress                                            # stress drops
fit3d <- envfit(nmds3d, varechem[ ,1:4], choices = 1:3)  # fit environmental vectors to 3d space
ordirgl(nmds3d, envfit = fit3d)                          # dynamic 3D graph

Running a principle component analysis (PCA) on environmental data

chem_pca <- rda(varechem, scale = TRUE)    # Run PCA
biplot(chem_pca, scaling = 2)              # display biplot

Constrained Ordination

Clustering

Basic clustering

K-Means Clustering

You can use the kmeans() function.

First create some data:

> dat <- matrix(rnorm(100), nrow=10, ncol=10)

To apply kmeans(), you need to specify the number of clusters:

> cl <- kmeans(dat, 3) # here 3 is the number of clusters
> table(cl$cluster)
 1  2  3 
38 44 18

Hierarchical Clustering

The basic hierarchical clustering function is hclust(), which works on a dissimilarity structure as produced by the dist() function:

> hc <- hclust(dist(dat)) # data matrix from the example above
> plot(hc)

The resulting tree can be cut using the cutree() function.

Cutting it at a given height:

> cl <- cutree(hc, h=5.1)
> table(cl)
cl
 1  2  3  4  5 
23 33 29  4 11

Cutting it to obtain given number of clusters:

> cl <- cutree(hc, k=5)
> table(cl)
cl
 1  2  3  4  5 
23 33 29  4 11

Available alternatives

See packages class, amap and cluster
See The R bioinformatic page on clustering

References

"The Elements of Statistical Learning"

External links

Clustergram: visualization and diagnostics for cluster analysis

Network Analysis

Wikipedia has related information at Social network

Introduction

We mainly use the following packages to demonstrate network analysis in R: statnet, sna, igraph. They are however not representing a complete list. See Task view of gR, graphical models in R for a complete list.

Creating simple graphs with igraph

 
> # load the appropriate library
> library(igraph)
> # now create a few simple graphs
> # an undirected graph with 10 nodes and without any edge
> g1 <- graph.empty(10,directed=FALSE)
> # a directed graph with 10 nodes
> g2 <- graph.ring(10,directed=TRUE)
> # a complete undirected graph with 10 nodes
> g3 <- graph.full(10,directed=FALSE)
> # now get information about these graphs
> summary(g1)
> # g1 is an igraph object, U = Undirected, with 10 nodes and 0 edge
> IGRAPH U--- 10 0 -- 
> summary(g2)
> # g1 is an igraph object,  D = Directed
> IGRAPH D--- 10 10 -- Ring graph

Creating graphs from data

First load the igraph package

library(igraph)

then you can choose your preferred format. Below are examples of data provided as edge list and as adjacency matrix.

Creating graph from an edge list

An edge list is formed by a two-column matrix, with each row defining one edge. An edge is drawn from each element in the first column to the corresponding element in the second one. Use the graph.edgelist() function to import your data.

 
# producing some random data in edge list form
el <- cbind(sample(1:10, 10), sample(1:10, 10))

# creating and plotting the graph from the edge list
gr <- graph.edgelist(el)
plot(gr)

Creating graph from an adjacency matrix

An adjacency matrix is a n × n matrix containing n vertices and where each entry a_ij represents the number of edges from vertex i to vertex j. To import your adjacency matrix, use the graph.adjacency() function.

 
# producing a random adjacency matrix
adj <- matrix(sample(0:1, 100, replace=T), 10, 10)

# creating and plottig the graph from the adjacency matrix
gr <- graph.adjacency(adj)
plot(gr)

References

Statnet website includes all the documentation on network analysis using R.
Julien Barnier's introduction (in French)
Journal of Statistical Software #24 Special Issue on Networks in R

Previous: Factor Analysis

Index

Profiling R code

Before starting with parallel or high performance computing it is important to analyze and optimize R code. R provides some useful tools to analyze and profile R code. A good and short introduction is provided in the R extension documentation.

Soon we are going to provide some example code:

Parallel computing with R

There are many packages and tools available for parallel computing with R. A good overview is provided by the CRAN Task View: High-Performance and Parallel Computing with R and several publications:

State of the Art in Parallel Computing with R; Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann; Journal of Statistical Software 2009: JSS

Soon we are going to provide some code examples:

Sources

For the following resources, authors have explicitly given the permission to include their material on the R programming wikibook. Remember that even if they have given their permission, they should be correctly cited.

Blogs

R-statistics (the R category) (A link to a post which provides proper licence for approving this content for use).
GETTING GENETICS DONE - R tag. The R content is available from here: http://gettinggeneticsdone.blogspot.com/search/label/R. The R code is copyrighted under The open source BSD license (as is described here: http://gettinggeneticsdone.blogspot.com/p/copyright.html). The content itself is licensed under a Creative Commons Attribution-Share-Alike 3.0 Unported License (as is shown at the bottom of every post). Bottom line - the R code and written content can be used freely (with attribution).
Struggling Through Problems: http://strugglingthroughproblems.blogspot.com/search/label/R
Backsidesmack R-stuff category. Copyright information is in the footer and explicit permission is in this post
Al3xandr3: http://al3xandr3.github.com/tags/r.html
Cloudnumbers.com (the R category): Posts about high-performance computing and cloud computing with R. A link to a post which provides proper license for approving this content for use.
The R Tutorial Series (http://rtutorialseries.blogspot.com) by John M. Quick provides a collection of user-friendly guides to researchers, students, and others who want to learn how to use R for their statistical analyses. Its content is available for use in the R Programming wikibook under a Creative Commons BY-SA License.

Exploring Indian census data using R and useful scripts to download weather related data from websites. The content is available for use in the R wikibook under cc-sa license.
Plain Data Analysis tips at www.danielmarcelino.com .Topics covered in the blog are related to social sciences, but there is a great variety of them.
R Tutorial [5]
R Workshop [6]

Handouts

Index

This page provides tables which make it easy to find functions for usual tasks in statistics in R, SAS and Stata. Other software may also be included in the future such as SPSS.

Data management

Function	R	Stata	SAS
Merge	merge()	merge / mmerge	-
Reshape	reshape()	reshape	-
Expand a dataset	expand() (epicalc)	expand	-

Descriptive Statistics

Function	R	Stata	SAS
Mean	mean()	mean	proc means
Histogram	hist()	hist	-
Frequency table	table()	ta	proc freq

Regression models

Function	R	Stata	SAS
Least Square	lm()	reg	proc reg
GLM	glm()	glm	proc glm
Probit models	glm(, family = binomial(link="probit"))	probit	-
Logit models	glm(, family = binomial(link="logit"))	logit	-
Linear fixed effects model	plm( , model = "within") (plm)	xtreg , fe	-
Linear random effects model	plm( , model = "random") (plm)	xtreg , re	-
Linear quantile regression	rq() (quantreg)	qreg	-
Ordinal logistic regression	polr() (MASS)	ologit	-
Linear IV (2sls)	ivreg() (AER)	ivreg	proc syslin (2sls) -

Programming

Function	R	Stata	SAS
Check some condition	stopifnot()	assert	-

[portable-1] Portable R by Andrew Redd http://sourceforge.net/projects/rportable/

[2] "Python Data Analysis Library". pandas.pydata.org/. Retrieved February 14, 2013.

[3] "Getting started with Pandas". blog.kaggle.com. January 17, 2013. Retrieved February 14, 2013.

[4] This section was imported from the Wikiversity project Installation, How to use R course

[5] Updating R from R (on Windows) – using the {installr} package

[portableR-6] Portable R http://sourceforge.net/projects/rportable/

[7] ttp://jgr.markushelbig.org/JGR.html

[8] ttp://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

[rstudio-9] rstudio.org

[verzani-10] John Verzani "Getting Started with RStudio An Integrated Development Environment for R", O'Reilly Media, September 2011

[11] Jeffrey Racine, (forthcoming), "RStudio: A Platform Independent IDE for R and Sweave," Journal of Applied Econometrics.

[statet-12] StatET : http://www.walware.de/goto/statet

[13] Rattle : http://rattle.togaware.com/

[14] Graham J Williams. Rattle: A Data Mining GUI for R. The R Journal, 1(2):45-55, December 2009

[15] Tinn stands for Tinn Is Not Notepad http://www.sciviews.org/Tinn-R/

[16] Note that Notepad++ can be installed on a USB stick http://sourceforge.net/projects/notepadpluspe/

[17] NPPtoR is also a portable software http://sourceforge.net/projects/npptor/

[18] ESS : http://ess.r-project.org/

[19] Vincent Goulet Emacs page http://vgoulet.act.ulaval.ca/emacs

[20] ttp://www.people.fas.harvard.edu/~aeggers/RWinEdt_installation.pdf

[21] If you want to know more about regular expressions, have a look at the Regular expressions section in the Text Processing page.

[22] Introduction to Data Analysis

[23] ttp://r.789695.n4.nabble.com/Optional-data-argument-for-a-function-tp850247p850247.html

[24] Patrick Burns, The R Inferno, p. 24

[25] Henrik Bengtsson (2009). R.utils: Various programming utilities. R package version 1.1.7. http://CRAN.R-project.org/package=R.utils

[26] Nathaniel Beck "Making regression and related output more helpful to users" The Political Methodologist 2010 http://politics.as.nyu.edu/docs/IO/2576/beck_tpm_edited.pdf

[27] Jeff Gill graph.summary() http://artsci.wustl.edu/~jgill/Models/graph.summary.s

[ucla-28] See the example on the UCLA Statistics webpage : http://www.ats.ucla.edu/stat/r/faq/deltamethod.htm

[29] Kosuke Imai, Gary King and Olivia Lau (2009). Zelig: Everyone's Statistical Software. R package version 3.4-5. http://CRAN.R-project.org/package=Zelig

[30] Friedrich Leisch Creating R Packages : A Tutorial http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

[31] ttp://www.r-project.org/conferences/useR-2008/slides/Murdoch.pdf

[32] Hadley Wickham and Jennifer Bryan R Packages : Organize, Test, Docment and Share your code https://r-pkgs.org/

[33] The AER Package http://cran.r-project.org/web/packages/AER/index.html

[34] The EcDat Package http://cran.r-project.org/web/packages/Ecdat/index.html

[35] "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule - Journal of the Royal Statistical Society, June 1899, p 283

[36] Reshaping Data with the reshape Package : http://www.jstatsoft.org/v21/i12

[37] vignette for the tables package: http://cran.r-project.org/web/packages/tables/vignettes/tables.pdf

[38] Merging data frames while preserving the rows

[RDataManual-39] R Data Manual http://cran.r-project.org/doc/manuals/R-data.html

[IdT-40] Paul Murrell introduction to Data Technologies http://www.stat.auckland.ac.nz/~paul/ItDT/

[41] Stat Transfer

[42] speedR

[43] "Opening Large CSV Files in R". Retrieved March 7, 2013. {{cite web}}: Unknown parameter |site= ignored (help)

[44] David Smith. "Importing public data with SAS instructions into R". Revolution Analytics. Retrieved February 1, 2013.

[45] sas7bdat

[gnumeric-46] This command has been tested using Ubuntu 10.10 and R 2.11.1

[47] ttp://cran.r-project.org/web/packages/rjson/index.html

[hdf5-48] ttp://cran.r-project.org/web/packages/hdf5/index.html

[49] Brandon Whitcher, Volker J. Schmid, Andrew Thorton "Working with the {DICOM} and {NIfTI} Data Standards in R", Journal of Statistical Software Vol. 44, Issue 6, Oct 2011, link

[RDataManual-50] R Data Manual

[IdT-51] Paul Murrell introduction to Data Technologies

[stringr-52] Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/2, http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf

[53] ttp://cran.r-project.org/web/views/NaturalLanguageProcessing.html

[54] In former versions (< 2.10) we had also basic regular expressions in R :
extended regular expressions, used by extended = TRUE (the default),

basic regular expressions, as used by extended = FALSE (obsolete in R 2.10).
Since basic regular expressions (‘extended = FALSE’) are now obsolete, the extended option is obsolete in version 2.11.

[55] xtended regular expressions, used by extended = TRUE (the default),

[56] sic regular expressions, as used by extended = FALSE (obsolete in R 2.10).

[1]

[2]

[3]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[1]

[2]

[1]

[2]

[1]

[1]

[2]

[3]

[4]

[1]

[2]

[3]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[1]

[2]

[3]

[1]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[1]

[1]

[2]

[3]

[1]

[2]

[3]

[4]

[5]

[1]

[2]

[3]

Introduction

What is R ?

Why use R?

Alternatives to R

See also

R programming style

How you can help

See Also

References

Sample Session

Manage your workspace

Basic functions

Informations about the session

Memory usage

References

External links

Settings

Installation

Linux

Mac OS

Windows

Download

Setup

Update

Portable R for Windows

Working environment

Terminal

R Gui

Graphical User Interface

Poor Man's GUI (pmg)

Jaguar : Java GUI for R

R commander

Integrated development environment

RStudio

RKward

Eclipse with StatET

Rattle GUI

Tinn R

Notepad++ and NpptoR

Vi, Vim and GVim

Emacs and ESS

WinEdt

gedit with gedit-r-plugin

Customizing R

R profile

Options

References

Documentation

Obtaining Help

Handouts

Teaching Resources

Blogs

Journals

Books

useR and other R conferences

Search Engine

Q&A / Forums

References

Control Structures

Conditional execution

Sets

Loops

Implicit loops

Explicit loops

Iterators

References

Working with functions

Looking at the code of a function

Creating your own function

A simple function without argument which doesn't return anything

Returning an object

Adding arguments

Recursive functions

Functions as Objects

Higher-order functions

apply

tapply

Reduce

References

Debugging

Tracing errors with `traceback()`