# R Programming/Working with data frames

In this section, we deal with methods to read, manage and clean-up a data frame.

In R, a dataframe is a list of vectors of the same length. They don't have to be of the same type. For instance, you can combine in one dataframe a logical, a character and a numerical vector.

## Reading and saving data[edit| edit source]

If data are already in an R format (`.Rda`

or `.Rdata`

), you can load them in memory using `load()`

.
You can save data to the R format using `save()`

.

```
load("mydata.Rda")
save(list='mydata',file="mydata.Rda")
```

## Example Datasets[edit| edit source]

- Most packages include example datasets to test the functions.
- The
`data()`

function without argument gives the list of all example datasets in all the loaded packages. - If you want to load them in memory, you just need to use the data function and include the name of the dataset as an argument.
`str_data()`

(**sfsmisc**) gives the structure of all datasets in a package.

```
> data() # lists all the datasets in all the packages in memory
> data(package="datasets") # lists all the datasets in the "datasets" package
> data(Orange) # loads the orange dataset in memory
> ?Orange # Help for the "Orange" Datasets
> str_data("datasets") # gives the structure of all the datasets in the datasets package.
```

- Some packages include lots of datasets.
- The
**datasets**package - The
**AER**package^{[1]}includes replication datasets for some important textbooks in econometrics. - The
**EcDat**package^{[2]}includes replication archive for the Journal of Applied Econometrics, the Journal of Business and Economic Statistics, etc.

- The

## Building your own data frames[edit| edit source]

You can create a dataframe using vectors.

```
N <- 100
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- 1 + x1 + x2 + u
mydat <- data.frame(y,x1,x2)
```

R has a spreadsheet-style data editor. One can use it to enter the data into a spreadsheet.

```
mydat <- edit(data.frame())
```

Read table from the clipboard :

```
> mydat <- read.table("clipboard")
```

You can also read space delimited tables in your code using `gsource()`

(**Zelig**). Here is an example with Yule 1899 data.^{[3]}

```
mydat <- gsource(var.names = "id union pauperism out old pop",
variables = "
1 Kensington 27 5 104 136
2 Paddington 47 12 115 111
3 Fulham 31 21 85 174
")
```

You can change the column names for a dataFrame.

```
c1 <- c('A','B','C')
c2 <- c('Alpha','Bravo','Charlie')
c3 <- c('1','2','3')
mydf <- data.frame(c1,c2,c3)
colnames(mydf) <- c('ColName1','ColName2','ColName3')
```

## Describing a data frame[edit| edit source]

There are various ways to inspect a data frame, such as:

`str(df)`

gives a very brief description of the data`names(df)`

gives the name of each variable`summary(df)`

gives some very basic summary statistics for each variable`head(df)`

shows the first few rows`tail(df)`

shows the last few rows.

## Browsing data[edit| edit source]

- You can browse your data in a spreadsheet using
`View()`

. Depending on your operating system, this option is not always available and the result is not always the same. - You can print the first lines using
`head()`

and the last lines using`tail()`

.

```
View(mydata)
head(mydata, n = 20) # n = 20 means that the first 20 lines are printed in the R console
```

- RStudio has a nice data browser (
`View(mydata)`

). - RKward has also a nice data browser
- Paul Murrell is currently developping the
**rdataviewer**package (pdf).

## Binding row or column[edit| edit source]

Most of the times when you are working with data frames, you are changing the data and one of the several changes you can do to a data frame is adding column or row and as the result increase the dimension of your data frame. There are few different ways to do it but the easiest ones are `cbind()`

and `rbind()`

which are part of the **base** package:

```
mydata <- cbind(mydata, newVector)
mydata <- rbind(mydata, newVector)
```

Remember that the length of the newVector should match the length of the side of the data frame that you are attaching it to. For example, in the `cbind()`

command the following statement should be TRUE:

```
dim(mydata)[1]==length(newVector)
```

To see more samples, you can always do `?base::cbind`

and `?base::rbind`

.

## Attaching data[edit| edit source]

One of the big advantages of R over Stata is that you can deal with multiple datasets at the same time. You just need to specify the name of the dataset and a "$" symbol before each variable name ( for instance `mydat1$var1`

and `mydat2$var1`

). If you only work with one dataset and you don't want to write again and again the name of the dataset as a prefix for each variable, you can use `attach()`

.

```
mydata$var1
attach(mydata)
var1
detach(mydata)
```

## Detecting duplicates[edit| edit source]

When you want to clean up a data set, it is very often useful to check if you don't have the same information twice in the data. R provides some functions to detect duplicates.

`duplicated()`

looks at duplicated elements and returns a logical vector. You can use`table()`

to summarize this vector.`Duplicated()`

(**sfsmisc**) generalizes this command.`Duplicated()`

only marks unique values with "NA".`remove.dup.rows()`

(**cwhmisc**).`unique()`

keeps only the unique lines in a dataset.`distinct()`

(**dplyr**) retains only unique/distinct rows from a dataset.

```
library("Zelig")
mydat <- gsource(
variables = "
1 1 1 1
1 1 1 1
1 2 3 4
1 2 3 4
1 2 2 2
1 2 3 2")
unique(mydat) # keep unique rows
library(cwhmisc)
remove.dup.rows(mydat) # similar to unique()
table(duplicated(mydat)) # table duplicated lines
mydat$dups <- duplicated(mydat) # add a logical variable for duplicates
```

## Creating and removing variables[edit| edit source]

To create a new variable

```
mydata$newvar <- oldvar
```

If you want to delete a variable in a dataset, you can assign NULL to that variable :

```
# Delete the x variable in the df data frame.
df$x <- NULL
```

## Renaming variables[edit| edit source]

- It is possible to rename a variable by redefining the vector of names of a data frame.
- There is also a
`rename()`

function in the**reshape**package.

```
df <- data.frame(x = 1:10, y = 21:30)
names(df)
names(df) <- c("toto","tata")
names(df)
names(df)[2] <- "titi"
names(df)
```

## Creating a subset of the data[edit| edit source]

One can subset the data using `subset()`

. The first argument is the name of the dataset, the second argument is a logical condition which say which lines will be included in the new dataset and the last argument is the list of variable which will be included in the new dataset.

In the following example, we generate a fake dataset and we use the `subset()`

command to select the lines and columns of interest. We choose the lines such that x1 > 0 and x2 < 0 and we only keep x1 and x2 as variables.

```
N <- 100
x1 <- rnorm(N)
x2 <- 1 + rnorm(N) + x1
x3 <- rnorm(N) + x2
mydat <- data.frame(x1,x2,x3)
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = - x3) # the same.
```

It is also possible to reorder the columns using the `select`

option.

```
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x2,x1))
```

## Sorting and ordering[edit| edit source]

`order()`

```
mydat[order(var1,var2),]
```

Suppose you want to randomize the order in a data set. You just need to generate a vector from a uniform distribution and to sort following that vector.

```
df[order(runif(nrow(df))),]
```

## Detecting missing values[edit| edit source]

`is.na()`

returns a logical vector equal to TRUE if any of the variable in a dataset is missing and to FALSE otherwise.`complete.cases()`

returns a logical vector indicating TRUE if all cases are complete and FALSE otherwise.

```
> table(complete.cases(df))
```

## Reshaping a dataframe[edit| edit source]

This topic is important if you deal with panel data. Panel data can be stored in a wide format with one observation per unit and a variable for each time period or in a long format with one observation per unit and time period. `reshape()`

reshapes a dataset in a wide or long format.

```
> country <- c("'Angola'","'UK'","'France'")
> gdp.1960 <- c(1,2,3)
> gdp.1970 <- c(2,4,6)
> mydat <- data.frame(country,gdp.1960,gdp.1970)
> mydat # wide format
country gdp.1960 gdp.1970
1 Angola 1 2
2 UK 2 4
3 France 3 6
> reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format
country time gdp id
1.1 Angola 1 1 1
2.1 UK 1 2 2
3.1 France 1 3 3
1.2 Angola 2 2 1
2.2 UK 2 4 2
3.2 France 2 6 3
```

`varying`

gives the numbers of the columns which are time-varying`v.names`

gives the prefix of the time-varying variables`direction`

gives the direction, either "long" or "wide".

- See also :
`reShape()`

(**Hmisc**)- See Hadley Wickham's
**reshape**package^{[4]} - See Duncan Murdoch's
**tables**package^{[5]}

### External links[edit| edit source]

## Expanding a dataset[edit| edit source]

Sometimes we need to duplicate some lines in a dataset. For instance, if we want to generate a fake dataset with a panel data structure. In that case, we would first generate time invariant variables and then duplicate each line by a given scalar in order to create time-varying variables.

It is possible to use the `expand()`

function in the **epicalc** package (since this package does not exist anymore, an option to expand is given in [1]). This will multiply each line by a given number.

```
N <- 1000
T <- 5
wide <- data.frame(id = 1:N,f = rnorm(N), rep = T)
library("epicalc")
long <- expand(wide,index.var = "rep")
long$time <- rep(1:T,N)
```

We can also use the *do it yourself solution* or create our own function. The idea is simple. We create a vector which igives for each line the number of times it should be replicated (dups in the following example). Then we use the `rep()`

function to create a vector which repeats the line numbers according to what we want. The last step creates a new dataset which repeats lines according to the desired pattern.

```
expand <- function(df,dups){
df$dups <- dups
pattern <- rep(1:nrow(df), times=df$dups)
df2 <- df[pattern,]
index <- function(x){
1:length(x)
}
df2$year <- unlist(tapply(df2$dups, df2$id, index))
df2$dups <- NULL
return(df2)
}
df <- data.frame(x = rnorm(3), id = 1:3)
dups = c(3,1,2)
expand(df,dups)
```

## Merging dataframes[edit| edit source]

Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :

We have one table describing authors :

```
> authors <- data.frame(
+ surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+ nationality = c("US", "Australia", "US", "UK", "Australia"),
+ deceased = c("yes", rep("no", 4)))
> authors
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
```

and one table describing books

```
> books <- data.frame(
+ name = I(c("Tukey", "Venables", "Tierney",
+ "Ripley", "Ripley", "McNeil", "R Core")),
+ title = c("Exploratory Data Analysis",
+ "Modern Applied Statistics ...",
+ "LISP-STAT",
+ "Spatial Statistics", "Stochastic Simulation",
+ "Interactive Data Analysis",
+ "An Introduction to R"),
+ other.author = c(NA, "Ripley", NA, NA, NA, NA,
+ "Venables & Smith"))
> books
name title other.author
1 Tukey Exploratory Data Analysis <NA>
2 Venables Modern Applied Statistics ... Ripley
3 Tierney LISP-STAT <NA>
4 Ripley Spatial Statistics <NA>
5 Ripley Stochastic Simulation <NA>
6 McNeil Interactive Data Analysis <NA>
7 R Core An Introduction to R Venables & Smith
```

We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the observations from the author dataset which match with an observation in the books dataset.

```
> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
> final
name title other.author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics ... Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
7 R Core An Introduction to R Venables & Smith <NA> <NA>
```

It is also possible to merge two data.frame objects while preserving the rows’ order by one of the two merged objects.^{[6]}

## Resources[edit| edit source]

- R Data Manual.
^{[7]} - Paul Murrell's
*Introduction to Data Technologies*.^{[8]}

## References[edit| edit source]

- ↑ The AER Package http://cran.r-project.org/web/packages/AER/index.html
- ↑ The EcDat Package http://cran.r-project.org/web/packages/Ecdat/index.html
- ↑ "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule - Journal of the Royal Statistical Society, June 1899, p 283
- ↑ Reshaping Data with the reshape Package : http://www.jstatsoft.org/v21/i12
- ↑ vignette for the tables package: http://cran.r-project.org/web/packages/tables/vignettes/tables.pdf
- ↑ Merging data frames while preserving the rows
- ↑ R Data Manual http://cran.r-project.org/doc/manuals/R-data.html
- ↑ Paul Murrell
*introduction to Data Technologies*http://www.stat.auckland.ac.nz/~paul/ItDT/