Statistics for Sociology/Data

From Wikibooks, open books for an open world
< Statistics for Sociology
Jump to navigation Jump to search

How to Load Data into the R Workspace[edit]

The command to load data into the R workspace depends upon the format of the data. R can load data from many sources, including, for instance, CSV, SPSS, DAT, and XLS files.

For instance, to load a dataset from the CSV format, the following command will work:


Most datasets include the names of the variables as the first row or "header" row. In that instance, you can use the following command to tell R that the CSV file has a header row:

read.csv("/home/user/data.csv", header=TRUE)

Both of the above commands will read the dataset into the R workspace. However, a more effective way to load data into the workspace is to put the data into a variable, which makes it easier to work with the data. This is done with the following command:

data <- read.csv("/home/user/data.csv", header=TRUE)

How to Save a Dataset[edit]

After working with and potentially modifying a dataset (DATASET), it is important to save the dataset. This can be done with the following command:

save(DATASET, file = "dataset.RData")

Exploring Data in R[edit]

There are a number of functions in R that allow you to view the nature of your data.

To see the general structure of your dataset (the names of the variables and the type), you can use:


To see the column numbers of the variables in your dataset, you can use:


To see a list of all the variables in your dataset (DATA):


Cleaning Data in R[edit]

One of the common tasks after data collection is to clean one's data, making sure that responses are coherent and that incomplete responses are removed. One approach for removing incomplete responses is to use a command that detects the missing responses, deletes them from the dataframe, and creates a new dataframe. In the example below, a variable was included in the survey that measured how much of the survey was completed. The example uses that variable (Progress) to select the cases in which participants completed at least 70% of the questions and creates a new dataset (cleaned_data) by removing the cases with fewer than 70% of the responses completed.

cleaned_data <- raw_data[raw_data$Progress > 70,]

Creating New Variables in R[edit]

Let's assume you have a dataset (DATA) and want to calculate a new variable (VAR_NEW) based on existing variables (VAR1, VAR2, etc.) and insert it into the dataset. The new variable (VAR_NEW) is going to be the mean of 3 other variables. Here's how you would do that in R:

DATA$VAR_NEW <- (VAR1 + VAR2 + VAR3)/3

Converting Variable Types in R[edit]

R restricts the functions you can perform on variables based on the type of variable. Mathematical operations cannot be done on variables that are not numeric (e.g., "character" type variables). At times, it is necessary to convert a variable (VAR) that is the wrong type into the correct type (e.g., from "character" to "numeric"). This can be done with the following command:

DATA$VAR <- as.numeric(as.character(DATA$VAR))

Recoding Variables in R[edit]

It's not uncommon in the social sciences to need to recode a variable (VAR). For instance, you might have a scale with 10 different questions and half of the questions are reverse coded (i.e., they are coded the opposite from the other variables; 1=5 and 5=1). In such a situation, before combining the items in the scale, it's necessary to reverse code the variable. Here's a way to do that in R.

First, install the package "car":


Next, load the "car" library:


Then, recode the variable:

VARX <- recode(VAR, '1=5, 2=4, 3=3, 4=2, 5=1')

Viewing Variable Labels in R (from SPSS)[edit]

SPSS datasets include labels for variable categories (e.g., for the variable "race," 1 might equal "black"). To view the labels for the variable "VAR" in SPSS if you have opened the SPSS dataset (dataset = DATA) with the "haven" library, you can use the following command:


You should then see a list of the codes for each label followed by the label itself.

Two additional commands can be helpful. First, to get the variable label (so, the description of the variable):


To get the value labels (in a different format from above), you can use the following command: