R Programming/Data
From Wikibooks, the open-content textbooks collection
Contents |
[edit] Example Datasets
- Most packages include example datasets
- The data() function without argument gives the list of all example datasets in all the loaded packages.
- If you want to load them in memory, you just need to use the data function and include the name of the dataset as an argument
> data() # lists all the datasets in all the packages in memory > data(package="datasets") # lists all the datasets in the "datasets" package > data(Orange) # loads the orange dataset in memory
[edit] Data Input
- scan()
- readLines()
R has a spreadsheet-style data editor :
edit(mydata)
Read table from the clipboard :
> read.table("clipboard")
Holidays HalfDays FullDays
Norway 0.333 0.056 0.611
Canada 0.067 0.200 0.733
Greece 0.138 0.862 0.000
France/Germany 0.083 0.083 0.833
[edit] Importing/Exporting Data
- Hmisc csv.get()
- Hmisc sasexport.det()
- Hmisc sas.get()
- spss.get()
- stata.get()
Importing a text file :
mydata <- read.table("http://perso.univ-rennes1.fr/arthur.charpentier/data.txt",
header=TRUE)
mydata <- read.table("tmp.txt", header = TRUE, sep=",")
- Given the data file data.txt located at <path>:
1970 45 63 1980 52 59 1990 59 52 2000 63 45
- This data can easily be loaded into R using the following commands:
setwd("<path>") # change working directory
data <- read.table("data.txt") # load data
One can easily import data from SPSS, SAS, Stata and other statistical packages :
[edit] Stata
library(foreign)
mydata<-read.dta("STATAData.dta")
names(mydata)
[edit] SAS
library(foreign)
mydata<-read.xport("SASData.xpt")
names(mydata)
[edit] SPSS
library(foreign)
mydata<-read.spss("SPSSData.sav")
names(mydata)
[edit] Excel
- xlsReadWrite is no longer available.
- gdata includes a function read.xls
Import from an excel spreadsheet
> library(xlsReadWrite)
mydata <- read.xls("myfile.xls", colNames = T, sheet = "mysheet",
+ type = "data.frame", from = 1, checkNames = TRUE)
- "sheet" specifies the name or the number of the sheet you want to import.
- "from" specifies the first row of the spreadsheet.
You can also use the "RODBC" package :
library(RODBC)
channel <- odbcConnectExcel("Graphiques pourcent croissance.xls") # creates a connection
sqlTables(channel) # List all the tables
effec <- sqlFetch(channel, "effec") # Read one spreadsheet as an R table
[edit] Google Doc Spreadsheet
[edit] Working with data
[edit] Load/Attach/Detach data
R allows you to load a datasets into memory such as you don't need to specify its name each time you use it.
attach(mydata) … detach(mydata)
[edit] Describe data
names(mydata)
str(mydata)
summary(mydata)
[edit] Dealing with missing values
[edit] Creating/removing variables
mydata$newvar <- oldvar
[edit] Exporting data
[edit] Merging two dataframes
Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :
We have one table describing authors :
> authors <- data.frame(
+ surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+ nationality = c("US", "Australia", "US", "UK", "Australia"),
+ deceased = c("yes", rep("no", 4)))
> authors
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
and one table describing books
> books <- data.frame(
+ name = I(c("Tukey", "Venables", "Tierney",
+ "Ripley", "Ripley", "McNeil", "R Core")),
+ title = c("Exploratory Data Analysis",
+ "Modern Applied Statistics ...",
+ "LISP-STAT",
+ "Spatial Statistics", "Stochastic Simulation",
+ "Interactive Data Analysis",
+ "An Introduction to R"),
+ other.author = c(NA, "Ripley", NA, NA, NA, NA,
+ "Venables & Smith"))
> books
name title other.author
1 Tukey Exploratory Data Analysis <NA>
2 Venables Modern Applied Statistics ... Ripley
3 Tierney LISP-STAT <NA>
4 Ripley Spatial Statistics <NA>
5 Ripley Stochastic Simulation <NA>
6 McNeil Interactive Data Analysis <NA>
7 R Core An Introduction to R Venables & Smith
We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the observations from the author dataset which match with an observation in the books dataset.
> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
> final
name title other.author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics ... Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
7 R Core An Introduction to R Venables & Smith <NA> <NA>