Jump to content

Statistical Analysis: an Introduction using R/R/Missing data

From Wikibooks, open books for an open world
When collecting data, it is often the case that certain data points are unknown. This happens for a variety of reasons. For example, when analysing experimental data, we might be recording a number of variables for each experiment (e.g. temperature, time of day, etc.), yet have forgotten (or been unable) to record temperature in one instance. Or when collecting social data on US states, it might be that certain states do not record certain statistics of interest. Another example is the ship passenger data from the sinking of the Titanic, where careful research has identified the ticket class of all 2207 people on board, but not been able to ascertain the age of 10 or so of the victims (see http://www.encyclopedia-titanica.org). We could just omit missing data, but in many cases, we have information for some variables, but not for others. For example, we might not want to completely omit a US state from an analysis, just because it it missing one particular datum of interest. For this reason, R provides a special value, NA, meaning "not available". Any vector, numeric, character, or logical, can have elements which are NA. These can be identified by the function "is.na".
Input:
some.missing <- c(1,NA)
is.na(some.missing)
Result:
some.missing <- c(1,NA)

is.na(some.missing) [1] FALSE TRUE

Note that some analyses are hard to do if there are missing data. You can use "complete.cases" or "na.omit" to construct datasets with the missing values omitted.