Statistics for Sociology/Data

From Wikibooks, open books for an open world
Jump to navigation Jump to search

How to Load Data into the R Workspace[edit]

The command to load data into the R workspace depends upon the format of the data. R can load data from many sources, including, for instance, CSV, SPSS, DAT, and XLS files.

For instance, to load a dataset from the CSV format, the following command will work:

read.csv("/home/user/data.csv")

Most datasets include the names of the variables as the first row or "header" row. In that instance, you can use the following command to tell R that the CSV file has a header row:

read.csv("/home/user/data.csv", header=TRUE)

Both of the above commands will read the dataset into the R workspace. However, a more effective way to load data into the workspace is to put the data into a variable, which makes it easier to work with the data. This is done with the following command:

data <- read.csv("/home/user/data.csv", header=TRUE)

How to Save a Dataset[edit]

After working with and potentially modifying a dataset (DATASET), it is important to save the dataset. This can be done with the following command:

save(DATASET, file = "dataset.RData")

Exploring Data in R[edit]

There are a number of functions in R that allow you to view the nature of your data.

To see the general structure of your dataset (the names of the variables and the type), you can use:

structure(DATA)

To see the column numbers of the variables in your dataset, you can use:

colnames(DATA)

To see a list of all the variables in your dataset (DATA):

names(DATA)

To see details of a specific variable, the "structure" command can also be used with a specific variable (VARIABLE):

structure(DATA$VARIABLE)

Another quick way to gather information about a variable in your dataset is to use the "summary" command, which will provide the minimum, maximum, mean, median, and quartiles (as well as NAs):

summary(DATA$VARIABLE)

Cleaning Data in R[edit]

One of the common tasks after data collection is to clean one's data, making sure that responses are coherent and that incomplete responses are removed. One approach for removing incomplete responses is to use a command that detects the missing responses, deletes them from the dataframe, and creates a new dataframe. In the example below, a variable was included in the survey that measured how much of the survey was completed. The example uses that variable (Progress) to select the cases in which participants completed at least 70% of the questions and creates a new dataset (cleaned_data) by removing the cases with fewer than 70% of the responses completed.

cleaned_data <- subset[raw_data, Progress > 70]

For those who prefer a graphical user interface (GUI) for modifying or editing data, there is a Addin for RStudio that allows this called "editData." Install it:

install.packages("editData")

Open a blank script file, type in the name of your dataset you want to edit, then click on the "Addins" drop down menu and select "editData." That will open the data editor window.

Creating New Variables in R[edit]

Let's assume you have a dataset (DATA) and want to calculate a new variable (VAR_NEW) based on existing variables (VAR1, VAR2, etc.) and insert it into the dataset. The new variable (VAR_NEW) is going to be the mean of 3 other variables. Here's how you would do that in R:

DATA$VAR_NEW <- (VAR1 + VAR2 + VAR3)/3

Converting Variable Types in R[edit]

R restricts the functions you can perform on variables based on the type of variable. Mathematical operations cannot be done on variables that are not numeric (e.g., "character" type variables). At times, it is necessary to convert a variable (VAR) that is the wrong type into the correct type (e.g., from "character" to "numeric"). This can be done with the following command:

DATA$VAR <- as.numeric(as.character(DATA$VAR))

Recoding Variables in R[edit]

It's not uncommon in the social sciences to need to recode a variable (VAR). For instance, you might have a scale with 10 different questions and half of the questions are reverse coded (i.e., they are coded the opposite from the other variables; 1=5 and 5=1). In such a situation, before combining the items in the scale, it's necessary to reverse code the variable. Here's a way to do that in R.

First, install the package "car":

install.packages("car")

Next, load the "car" library:

library(car)

Then, recode the variable:

VARX <- recode(VAR, '1=5; 2=4; 3=3; 4=2; 5=1')

Options in recording include an "else" statement, if, for instance, you only need to recode one or two values and the rest can be coded into a different value:

VARX <- recode(VAR, '1=73; 2=25; else=99')

Viewing Variable Labels in R (from SPSS)[edit]

SPSS datasets include labels for variable categories (e.g., for the variable "race," 1 might equal "black"). To view the labels for the variable "VAR" in SPSS if you have opened the SPSS dataset (dataset = DATA) with the "haven" library, you can use the following command:

attributes(DATA$VAR)

You should then see a list of the codes for each label followed by the label itself.

The following command is helpful as well:

structure(DATA$VAR)

Two additional commands can be helpful. First, to get the variable label (so, the description of the variable):

get_label(DATA$VAR)

To get the value labels (in a different format from above), you can use the following command:

get_labels(DATA$VAR)

Creating Tables and Proportion Tables in R[edit]

A simple way to make a frequency distribution table in R is to use the "table" and "prop.table" commands. The first command calculates how many cases fall into each category of your variable. By storing that table, it is then possible to have R convert the table into a proportion table, which can then function as a frequency distribution. This works well with nominal/ordinal variables.

table_VAR <- table(VAR)
prop.table(table_VAR)

The same command can be used for interval/ratio variables, but that can get unwieldy of there are a lot of values. An easier approach is to organize the variable into groups. Here's how you might do this with a variable AGE that ranges from 18 to 89.

breaks <- seq(18, 89, by=10)
# this creates a series of breaks from 18 to 89 in 10 year segments
# now apply those to your variable
age.cut <- cut(AGE, breaks, right=FALSE)
age.table <- table(age.cut)
age.table

Alternatively, using the "Hmisc" package, you can create proportion tables with the following command:

describe(DATASET$VARIABLE)

Calculating Quartiles and Percentiles in R[edit]

The following command will calculate quartiles from a variable:

quantile(VARIABLE, na.rm = TRUE)

By specifying the exact cut points using the above command, you can also calculate percentiles (22nd, 45th, and 79th), as follows:

quantile(VARIABLE, c(.22, .45, .79), na.rm = TRUE)