R Programming/Data Management
In this section, we deal with methods to read, manage and clean the data. Data can be stored in a large variety of R objects (vectors, lists, dataframes, etc). However, it is usual to store them in a dataframe and this is the reason why we focus here on dataframes.
[edit] Reading and saving data
If data are already in an R format (.Rda or .Rdata), you can load them in memory using load(). You can save data to the R format using save().
load("mydata.Rda") save(mydata,file="mydata.Rda")
[edit] Example Datasets
- Most packages include example datasets
- The
data()function without argument gives the list of all example datasets in all the loaded packages. - If you want to load them in memory, you just need to use the data function and include the name of the dataset as an argument.
str_data()(sfsmisc) gives the structure of all datasets in a package.
> data() # lists all the datasets in all the packages in memory > data(package="datasets") # lists all the datasets in the "datasets" package > data(Orange) # loads the orange dataset in memory > ?Orange # Help for the "Orange" Datasets > str_data("datasets") # gives the structure of all the datasets in the datasets package.
- Some packages includes lots of datasets.
[edit] Building your own data frames
You can create a dataframe using vectors.
u <- rnorm(N) x1 <- rnorm(N) x2 <- rnorm(N) y <- 1+ x1 + x2 + u mydat <- data.frame(y,x1,x2)
R has a spreadsheet-style data editor. One can use it to enter the data into a spreadsheet.
mydat <- edit(data.frame())
Read table from the clipboard :
> mydat <- read.table("clipboard")
You can also read space delimited tables in your code using gsource() (Zelig). Here is an example with Yule 1899 data[3].
mydat <- gsource(var.names = "id union pauperism out old pop", variables = " 1 Kensington 27 5 104 136 2 Paddington 47 12 115 111 3 Fulham 31 21 85 174 ")
[edit] Importing and exporting data
R can read almost all files format.
You may have a look at speedR, a graphical user interface which helps at importing data from Excel, OpenOfficeCalc, CSV and other text files[4].
library(speedR) speedR()
If none of the following methods work, you can use Stat Transfer[5]. Most statistical software can export data in a CSV format and all of them can read CSV data. This is often the best solution to make data available to everyone.
[edit] CSV (csv,txt,dat)
You can import data from a text file (often CSV) using read.table(), read.csv() or read.csv2(). csv.get() (Hmisc) is another possibility.
mydata <- read.table("data.txt",header=TRUE) mydata <- read.table("data.csv", header = TRUE, sep=",") # import from a CSV mydata <- read.csv("data.csv", header=T) mydata <- read.table("data.csv", header = TRUE, sep=";") mydata <- read.csv2("data.csv", header=T)
You can export data to a text file using write.table().
write.table(mydat,file="mydat.csv",quote=T,append=F,sep=",",eol = "\n", na = "NA", dec = ".", row.names = T,col.names = T)
[edit] Fixed width text files
read.fwf() and write.fwf().
[edit] Unstructured text files
- See
scan()andreadLines()in the Reading and writing text files section.
[edit] Stata (dta)
- We can read Stata data using
read.dta()in the foreign package and export to Stata data format usingwrite.dta(). - Note that string variables in Stata are limited to 244 characters. This can be an issue during the exportation process.
- See also
Stata.file()in the memisc package andstata.getin the Hmisc package.
> library("foreign") > mydata <- read.dta("mydata.dta",convert.dates = TRUE, convert.factors = TRUE, convert.underscore = TRUE) > names(mydata) > write.dta(mydata, file = "mydata.dta")
[edit] SAS (sas7bdat)
Experimental support for SAS databases having the sas7bdat extension is provided by the sas7bdat[6] package. However, sas7bdat files generated by 64 bit versions of SAS, and SAS running on non-Microsoft Windows platforms are not yet supported.
[edit] SAS (xpt)
- See also
sasexport.get()andsas.get()in the Hmisc - See also the SASxport package.
library("foreign") mydata<-read.xport("SASData.xpt") names(mydata)
[edit] SPSS (sav)
read.spss()(foreign) andspss.get()(Hmisc)
> library("foreign") > mydata<-read.spss("SPSSData.sav") > names(mydata)
[edit] EViews
readEViews() in the hexView package for EViews files.
[edit] Excel (xls,xlsx)
Importing data from Excel is not easy. The solution depends on your operating system. The RODBC solution below works fine on Windows.
The RODBC solution :
library("RODBC") channel <- odbcConnectExcel("Graphiques pourcent croissance.xls") # creates a connection sqlTables(channel) # List all the tables effec <- sqlFetch(channel, "effec") # Read one spreadsheet as an R table odbcClose(channel) # close the connection (don't forget)
The xlsReadWrite package.
> library(xlsReadWrite) mydat <- read.xls("myfile.xls", colNames = T, sheet = "mysheet", type = "data.frame", from = 1, checkNames = TRUE)
- "sheet" specifies the name or the number of the sheet you want to import.
- "from" specifies the first row of the spreadsheet.
The gnumeric package[7]. This package use an external software called ssconvert which is usually installed with gnumeric, the Gnome office spreadsheet. The read.gnumeric.sheet() function reads xls and xlsx files.
library("gnumeric") df1 <- read.gnumeric.sheet(file = "df.xls", head = TRUE, sheet.name = "Feuille1") df2 <- read.gnumeric.sheet(file = "df.xlsx", head = TRUE, sheet.name = "Feuille1")
See also xlsx for Excel 2007 documents and read.xls() (gdata).
The XLConnect package.
require("XLConnect") wb <- loadWorkbook("myfile.xls", create = FALSE) # Show a summary of the workbook (shows worksheets, # defined names, hidden sheets, active sheet name, ...) summary(wb) # Read data from a worksheet interpreting the first row as column names df1 <- readWorksheet(wb, sheet = "mysheet") # Read data from a named region/range interpreting the first row as column # names df2 <- readNamedRegion(wb, name = "myname", header = TRUE)
XLConnect supports reading and writing both xls and xlsx file formats. Since it is based on Apache POI it only requires a Java installation and as such works on many platforms including Windows, UNIX/Linux and Mac. Besides reading & writing data it provides a number of additional features such as adding plots, cell styling & style actions and many more.
If none of these methods work, we can always export each excel spreadsheets to CSV format and read the CSV in R.
[edit] Google Spread Sheets
You should make the spreadsheet public, publish it as a CSV file. Then you can read it in R using read.csv(). See more on the Revolution's computing blog (link). See also RGoogleDocs (link).
# Read from a Google SpreadSheet. require(RCurl) myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AkuuKBh0jM2TdGppUFFxcEdoUklCQlJhM2kweGpoUUE&single=true&gid=0&output=csv") read.csv(textConnection(myCsv))
[edit] gnumeric spreadsheets
The gnumeric package[7]. read.gnumeric.sheet() reads one sheet and read.gnumeric.sheets() reads all sheets and store them in a list.
library("gnumeric") df <- read.gnumeric.sheet(file = "df.gnumeric", head = TRUE, sheet.name = "df.csv") View(df) df <- read.gnumeric.sheets(file = "df.gnumeric", head = TRUE) View(df$df.csv)
[edit] OpenOffice and LibreOffice (ods)
The gnumeric package does a good job for ODS files[7].
library("gnumeric") df <- read.gnumeric.sheet(file = "df.ods", head = TRUE, sheet.name = "Feuille1")
speedR is another alternative.
library("speedR") df <- speedR.importany(file = "df.ods")
Note that you can also use the speedR graphical user interface (speedR()) which will return the command line for replication.
library("speedR") speedR()
Or you may use ROpenOffice[8].
library("ROpenOffice") df <- read.ods(file = "df.ods")
[edit] dBase (dbf)
read.dbf() in the foreign package.
library("foreign") df <- read.dbf("salaries07.dbf") str(df)
[edit] Hierarchical Data Format (hdf5)
hdf5 data can be read using the hdf5 package[9].
|
|
This section is a stub. You can help Wikibooks by expanding it. |
[edit] SQL
|
|
This section is a stub. You can help Wikibooks by expanding it. |
[edit] DICOM and NIfTI
- See "Working with the {DICOM} and {NIfTI} Data Standards in R" in the Journal of Statistical Software[10]
[edit] Working with data frames
[edit] Browsing data
- You can browse your data in a spreadsheet using
View(). Depending on your operating system, this option is not always available and the result is not always the same. - You can print the first lines using
head()and the last lines usingtail().
View(mydata) head(mydata, n = 20) # n = 20 means that the first 20 lines are printed in the R console
- RStudio has a nice data browser (
View(mydata)). - RKward has also a nice data browser
- Paul Murrell is currently developping the rdataviewer package (pdf).
[edit] Attaching data
One of the big advantage of R over Stata is that you can deal with multiple datasets in the same time. You just need to specify the name of the dataset and a "$" symbol before each variable name ( for instance mydat1$var1 and mydat2$var1). If you only work with one dataset and you don't want to write again and again the name of the dataset as a prefix for each variable, you can use attach().
mydata$var1 attach(mydata) var1 detach(mydata)
[edit] Duplicates
duplicated()looks at duplicated elements and return a logical vector. You can usetable()to summarize this vector.Duplicated()(sfsmisc) generalizes this command.Duplicated()only marks unique values with "NA".remove.dup.rows()(cwhmisc).unique()keep only the unique lines in a dataset.
library("Zelig") mydat <- gsource( variables = " 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 1 2 2 2 1 2 3 2") unique(mydat) # keep unique rows library(cwhmisc) remove.dup.rows(mydat) # similar to unique() table(duplicated(mydat)) # table duplicated lines mydat$dups <- duplicated(mydat) # add a logical variable for duplicates
[edit] New variables
To create a new variable
mydata$newvar <- oldvar
[edit] How can I rename variables ?
- It is possible to rename variable by redefining the vector of names of a data frame.
- There is also a
rename()function in the reshape package.
df <- data.frame(x = 1:10, y = 21:30) names(df) names(df) <- c("toto","tata") names(df) names(df)[2] <- "titi" names(df)
[edit] Subsetting data
One can subset the data using subset(). The first argument is the name of the dataset, the second argument is a logical condition which say which lines will be included in the new dataset and the last argument is the list of variable which will be included in the new dataset.
In the following example, we generate a fake dataset and we use the subset() command to select the lines and columns of interest. We choose the lines such that x1 > 0 and x2 < 0 and we only keep x1 and x2 as variables.
N <- 100 x1 <- rnorm(N) x2 <- 1 + rnorm(N) + x1 x3 <- rnorm(N) + x2 mydat <- data.frame(x1,x2,x3) subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2)) subset(x = mydat, subset = x1 > 0 & x2 < 0, select = - x3) # the same.
It is also possible to reorder the columns using the select option.
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2)) subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x2,x1))
[edit] Sorting data
order()
mydat[order(var1,var2),]
Suppose you want to randomize the order in a data set. You just need to generate a vector from a uniform distribution and to sort following that vector.
df[order(runif(nrow(df))),]
[edit] Missing values
is.na()returns a logical vector equal to TRUE if any of the variable in a dataset is missing and to FALSE otherwise.complete.cases()returns a logical vector indicating TRUE if all cases are complete and FALSE otherwise.
> table(complete.cases(df))
[edit] How can I reshape my data ?
This topic is important if you deal with panel data. Panel data can be stored in a wide format with one observation per unit and a variable for each time period or in a long format with one observation per unit and time period. reshape() reshapes a dataset in a wide or long format.
> country <- c("Angola","UK","France") > gdp.1960 <- c(1,2,3) > gdp.1970 <- c(2,4,6) > mydat <- data.frame(country,gdp1960,gdp1970) > mydat # wide format country gdp1960 gdp1970 1 Angola 1 2 2 UK 2 4 3 France 3 6 > reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format country time gdp id 1.1 Angola 1 1 1 2.1 UK 1 2 2 3.1 France 1 3 3 1.2 Angola 2 2 1 2.2 UK 2 4 2 3.2 France 2 6 3
varyinggives the numbers of the columns which are time-varyingv.namesgives the prefix of the time-varying variablesdirectiongives the direction, either "long" or "wide".
- See also :
[edit] External links
- Aggregation and Restructuring of data.frame objects (a sample chapter from the “R in Action” book)
- Printing nested tables in R – bridging between the {reshape} and {tables} packages
[edit] Expanding a dataset
Sometimes we need to duplicate some lines in a dataset. For instance, if we want to generate a fake dataset with a panel data structure. In that case, we would first generate time invariant variables and then duplicate each line by a given scalar in order to create time-varying variables.
It is possible to use the expand() function in the epicalc package. This will multiply each line by a given number.
N <- 1000 T <- 5 wide <- data.frame(id = 1:N,f = rnorm(N), rep = T) library("epicalc") long <- expand(wide,index.var = "rep") long$time <- rep(1:T,N)
We can also use the do it yourself solution or create our own function. The idea is simple. We create a vector which igives for each line the number of times it should be replicated (dups in the following example). Then we use the rep() function to create a vector which repeats the line numbers according to what we want. The last step creates a new dataset which repeats lines according to the desired pattern.
expand <- function(df,dups){ df$dups <- dups pattern <- rep(1:nrow(df), times=df$dups) df2 <- df[pattern,] index <- function(x){ 1:length(x) } df2$year <- unlist(tapply(df2$dups, df2$id, index)) df2$dups <- NULL return(df2) } df <- data.frame(x = rnorm(3), id = 1:3) dups = c(3,1,2) expand(df,dups)
[edit] Merging dataframes ?
Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :
We have one table describing authors :
> authors <- data.frame( + surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")), + nationality = c("US", "Australia", "US", "UK", "Australia"), + deceased = c("yes", rep("no", 4))) > authors surname nationality deceased 1 Tukey US yes 2 Venables Australia no 3 Tierney US no 4 Ripley UK no 5 McNeil Australia no
and one table describing books
> books <- data.frame( + name = I(c("Tukey", "Venables", "Tierney", + "Ripley", "Ripley", "McNeil", "R Core")), + title = c("Exploratory Data Analysis", + "Modern Applied Statistics ...", + "LISP-STAT", + "Spatial Statistics", "Stochastic Simulation", + "Interactive Data Analysis", + "An Introduction to R"), + other.author = c(NA, "Ripley", NA, NA, NA, NA, + "Venables & Smith")) > books name title other.author 1 Tukey Exploratory Data Analysis <NA> 2 Venables Modern Applied Statistics ... Ripley 3 Tierney LISP-STAT <NA> 4 Ripley Spatial Statistics <NA> 5 Ripley Stochastic Simulation <NA> 6 McNeil Interactive Data Analysis <NA> 7 R Core An Introduction to R Venables & Smith
We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the observations from the author dataset which match with an observation in the books dataset.
> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F) > final name title other.author nationality deceased 1 Tukey Exploratory Data Analysis <NA> US yes 2 Venables Modern Applied Statistics ... Ripley Australia no 3 Tierney LISP-STAT <NA> US no 4 Ripley Spatial Statistics <NA> UK no 5 Ripley Stochastic Simulation <NA> UK no 6 McNeil Interactive Data Analysis <NA> Australia no 7 R Core An Introduction to R Venables & Smith <NA> <NA>
It is also possible to merge two data.frame objects while preserving the rows’ order by one of the two merged objects.
[edit] Resources
- R Data Manual[13].
- Paul Murrell's Introduction to Data Technologies[14].
[edit] References
- ↑ The AER Package http://cran.r-project.org/web/packages/AER/index.html
- ↑ The EcDat Package http://cran.r-project.org/web/packages/Ecdat/index.html
- ↑ "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule - Journal of the Royal Statistical Society, June 1899, p 283
- ↑ speedR http://speedr.r-forge.r-project.org/
- ↑ Stat Transfer http://www.stattransfer.com/
- ↑ sas7bdat http://cran.r-project.org/web/packages/sas7bdat/index.html
- ↑ a b c This command has been tested using Ubuntu 10.10 and R 2.11.1
- ↑ The Omega Project for Statistical Computing
- ↑ http://cran.r-project.org/web/packages/hdf5/index.html
- ↑ Brandon Whitcher, Volker J. Schmid, Andrew Thorton "Working with the {DICOM} and {NIfTI} Data Standards in R", Journal of Statistical Software Vol. 44, Issue 6, Oct 2011, link
- ↑ Reshaping Data with the reshape Package : http://www.jstatsoft.org/v21/i12
- ↑ vignette for the tables package: http://cran.r-project.org/web/packages/tables/vignettes/tables.pdf
- ↑ R Data Manual http://cran.r-project.org/doc/manuals/R-data.html
- ↑ Paul Murrell introduction to Data Technologies http://www.stat.auckland.ac.nz/~paul/ItDT/