Data is the life blood of statistical analysis. A recurring theme in this book is that most analysis consists of constructing sensible statistical models to explain the data that has been observed. This requires a clear understanding of the data and where it came from. It is therefore important to know the different types of data that are likely to be encountered. Thus in this chapter we focus on different types of data, including simple ways in which they can be examined, and how data can organised into coherent datasets.
R topics in Chapter 2
The R examples used in this chapter are intended to introduce you to the nuts and bolts of R, so may seem dry or even overly technical compared to the rest of the book. However, the topics introduced here are essential to understand how to use R and it is particularly important to understand them. They assume that you are comfortable with the concepts of assignment (i.e. storing objects) and functions, as detailed previously.
The simplest sort of data is just a collection of measurements, each measurement being a single "data point". In statistics, a collection of single measurements of the same sort is commonly known as a variable, and these are often given a name. Variables usually have a reasonable amount of background context associated with them: what the measurements represent, why and how they were collected, whether there are any known omissions or exceptional points, and so forth. Knowing or finding out this associated information is an essential part of any analysis, along with examination of the variables (e.g. by plotting or other means).
One of the most fundamental objects in R is the vector, used to store multiple measurements of the same type (e.g. data variables). There are several different sorts of data that can be stored in a vector. Most common is the numeric vector, in which each element of the vector is simply a number. Other commonly used types of vector are character vectors (where each element is a piece of text) and logical vectors (where each element is either TRUE or FALSE). In this topic we will use some example vectors provided by the "datasets" package, containing data on States of the USA (see ?state).
R is an inherently vector-based program; in fact the numbers we have been using in previous calculations are just treated as vectors with a single element. This means that most basic functions in R will behave sensibly when given a vector as a argument, as shown below.
state.area #a NUMERIC vector giving the area of US states, in square miles
state.name #a CHARACTER vector (note the quote marks) of state names
sq.km <- state.area*2.59 #Arithmetic works on numeric vectors, e.g. convert sq miles to sq km
sq.km #... the new vector has the calculation applied to each element in turn
sqrt(sq.km) #Many mathematical functions also apply to each element in turn
range(state.area) #But some functions return different length vectors (here, just the max & min).
length(state.area) #and some, like this useful one, just return a single value.
> state.area #a NUMERIC vector giving the area of US states, in square miles
 "Illinois" "Indiana" "Iowa" "Kansas"
 "Kentucky" "Louisiana" "Maine" "Maryland"
 "Massachusetts" "Michigan" "Minnesota" "Mississippi"
 "Missouri" "Montana" "Nebraska" "Nevada"
 "New Hampshire" "New Jersey" "New Mexico" "New York"
 "North Carolina" "North Dakota" "Ohio" "Oklahoma"
 "Oregon" "Pennsylvania" "The smallest state" "South Carolina"
 "South Dakota" "Tennessee" "Texas" "Utah"
 "Vermont" "Virginia" "Washington" "West Virginia"
 "Wisconsin" "Wyoming"
> sq.km <- state.area*2.59 #Standard arithmatic works on numeric vectors, e.g. convert sq miles to sq km
> sq.km #... giving another vector with the calculation performed on each element in turn
 323.45487 354.50609 293.30334 165.51263 146.23826 388.30328 466.62203 351.54579
 424.83731 617.32278 447.23364 535.06878 155.23324 142.46136 561.35100 358.33202
 369.04978 427.81111 326.74911 425.54695 501.17940 342.65503 56.07370 283.60615
 446.71213 330.77479 832.11058 468.96955 157.75712 325.13205 420.25859 250.25745
 381.36447 503.58441
> range(state.area) #But some functions return different length vectors (here, just the max & min).
 1214 589757
> length(state.area) #and some, like this useful one, just return a single value.
Note that the first part of your output may look slightly different to that above. Depending on the width of your screen, the number of elements printed on each line of output may differ. This is the reason for the numbers in square brackets, which are produced when vectors are printed to the screen. These bracketed numbers give the position of the first element on that line, which is a useful visual aid. For instance, looking at the printout of state.name, and counting across from the second line, we can tell that the eighth state is Delaware.
You may occasionally need to create your own vectors from scratch (although most vectors are obtained from processing data in already-existing files). The most commonly used function for constructing vectors is c(), so named because it concatenates objects together. However, if you wish to create vectors consisting of regular sequences of numbers (e.g. 2,4,6,8,10,12, or 1,1,2,2,1,1,2,2) there are several alternative functions you can use, including seq(), rep(), and the : operator.
c("one", "two", "three", "pi") #Make a character vector
c(1,2,3,pi) #Make a numeric vector
seq(1,3) #Create a sequence of numbers
1:3 #A shortcut for the same thing (but less flexible)
i <- 1:3 #You can store a vector
i <- c(i,pi) #To add more elements, you must assign again, e.g. using c()
i <- c(i, "text") #A vector cannot contain different data types, so ...
i #... R converts all elements to the same type
i+1 #The numbers are now strings of text: arithmetic is impossible
rep(1, 10) #The "rep" function repeats its first argument
rep(3:1,10) #The first argument can also be a vector
huge.vector <- 0:(10^7) #R can easily cope with very big vectors
#huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible
> c("one", "two", "three", "pi") #Make a character vector
 "one" "two" "three" "pi"
> c(1,2,3,pi) #Make a numeric vector
 1.000000 2.000000 3.000000 3.141593
> seq(1,3) #Create a sequence of numbers
 1 2 3
> 1:3 #A shortcut for the same thing (but less flexible)
 1 2 3
> i <- 1:3 #You can store a vector
 1 2 3
> i <- c(i,pi) #To add more elements, you must assign again, e.g. using c()
 1.000000 2.000000 3.000000 3.141593
> i <- c(i, "text") #A vector cannot contain different data types, so ...
> i #... R converts all elements to the same type
 "1" "2" "3" "3.14159265358979" "text"
> i+1 #The numbers are now strings of text: arithmetic is impossible
Error in i + 1 : non-numeric argument to binary operator
> rep(1, 10) #The "rep" function repeats its first argument
 1 1 1 1 1 1 1 1 1 1
> rep(3:1,10) #The first argument can also be a vector
> huge.vector <- 0:(10^7) #R can easily cope with very big vectors
> #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
> rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible
It is common to want to access certain elements of a vector: for example, we might want to use only the 10th element, or the first 4 elements, or select elements depending on their value. The way to do this is to take the vector and prepend the indexing operator,  (i.e. square brackets). If these square brackets contain
A positive number or numbers, then this has the effect of picking those particular elements of the vector
A negative number or numbers, then this has the effect of picking the whole vector except those elements
A logical vector, then each element of the logical vector indicates whether to pick (if TRUE) or not (if FALSE) the equivalent element of the original vector.
The use of logical vectors may seem a little complicated. However, they can be extremely useful, because they are the key behind using comparison operators. These can be used, for example, to identify which US states are small, with an area less than (<) 10 000 square miles (as demonstrated below).
min(state.area) #This gives the area of the smallest US state...
which.min(state.area) #... this shows which element it is (the 39th as it happens)
state.name #You can obtain individual elements by using square brackets
state.name <- "THE SMALLEST STATE" #You can replace elements using  too
state.name #The 39th name ("Rhode Island") should now have been changed
state.name[1:10] #This returns a new vector consisting of only the first 10 states
state.name[-(1:10)] #Using negative numbers gives everything but the first 10 states
state.name[c(1,2,2,1)] #You can also obtain the same element multiple times
###Logical vectors are a little more complicated to get your head round
state.area < 10000 #A LOGICAL vector, identifying which states are small
state.name[state.area < 10000] #So this can be used to select the names of the small states
> min(state.area) #This gives the area of the smallest US state...
> which.min(state.area) #... this shows which element it is (the 39th as it happens)
> state.name #You can obtain individual elements by using square brackets
 "Rhode Island"
> state.name <- "The smallest state" #You can replace elements using  too
> state.name #The 39th name ("Rhode Island") should now have been changed
 "Minnesota" "Mississippi" "Missouri" "Montana"
 "Nebraska" "Nevada" "New Hampshire" "New Jersey"
 "New Mexico" "New York" "North Carolina" "North Dakota"
 "Ohio" "Oklahoma" "Oregon" "Pennsylvania"
 "THE SMALLEST STATE" "South Carolina" "South Dakota" "Tennessee"
 "Texas" "Utah" "Vermont" "Virginia"
 "Washington" "West Virginia" "Wisconsin" "Wyoming"
> state.name[c(1,2,2,1)] #You can also obtain the same element multiple times
 "Alabama" "Alaska" "Alaska" "Alabama"
> ###Logical vectors are a little more complicated to get your head round
> state.area < 10000 #A LOGICAL vector, identifying which states are small
 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
 FALSE FALSE FALSE FALSE FALSE
> state.name[state.area < 10000] #So this can be used to select the names of the small states
 "Connecticut" "Delaware" "Hawaii" "Massachusetts"
 "New Hampshire" "New Jersey" "THE SMALLEST STATE" "Vermont"
Although the  operator can be used to access just a single element of a vector, it is particularly useful for accessing a number of elements at once. Another operator, the double square-bracket ([[) exists for specifically accessing a single element. While not particularly useful for vectors, it comes into its own for #Lists and #Data frames.
When accessing elements of vectors, we saw how to use a simple logical expression involving the less than sign (<) to produce a logical vector, which could then be used to select elements less than a certain value. This type of logical operation is very useful thing to be able to do. As well as <, there are a handful of other comparison operators. Here is the full set (See ?Comparison for more details)
< (less than) and <= (less than or equal to)
> (greater than) and >= (greater than or equal to)
Even more flexibility can be gained by combining logical vectors using and, or, and not. For example, we might want to identify which US states have an area less than 10 000 or greater than 100 000 square miles, or to identify which have an area greater than 100 000 square miles and which have a short name. The code below shows how can be used to do this, using the following R symbols:
When using logical vectors, the following functions are particularly useful, as illustrated below
which() identifies which elements of a logical vector are TRUE
sum() can be used to give the number of elements of a logical vector which are TRUE. This is because sum() forces its input to be converted to numbers, and if TRUE and FALSE are converted to numbers, they take the values 1 and 0 respectively.
ifelse() returns different values depending on whether each element of a logical vector is TRUE or FALSE. Specifically, a command such as ifelse(aLogicalVector, vectorT, vectorF) takes aLogicalVector and returns, for each element that is TRUE, the corresponding element from vectorT, and for each element that is FALSE, the corresponding element from vectorF. An extra elaboration is that if vectorT or vectorF are shorter than aLogicalVector they are extended by duplication to the correct length.
### In these examples, we'll reuse the American states data, especially the state names
### To remind yourself of them, you might want to look at the vector "state.names"
nchar(state.name) # nchar() returns the number of characters in strings of text ...
nchar(state.name) <= 6 #so this indicates which states have names of 6 letters or fewer
ShortName <- nchar(state.name) <= 6 #store this logical vector for future use
sum(ShortName) #With a logical vector, sum() tells us how many are TRUE (11 here)
which(ShortName) #These are the positions of the 11 elements which have short names
state.name[ShortName] #Use the index operator  on the original vector to get the names
state.abb[ShortName] #Or even on other vectors (e.g. the 2 letter state abbreviations)
isSmall <- state.area < 10000 #Store a logical vector indicating states <10000 sq. miles
isHuge <- state.area > 100000 #And another for states >100000 square miles in area
sum(isSmall) #there are 8 "small" states
sum(isHuge) #coincidentally, there are also 8 "huge" states
state.name[isSmall | isHuge] # | means OR. So these are states which are small OR huge
state.name[isHuge & ShortName] # & means AND. So these are huge AND with a short name
state.name[isHuge & !ShortName]# ! means NOT. So these are huge and with a longer name
### Examples of ifelse() ###
ifelse(ShortName, state.name, state.abb) #mix short names with abbreviations for long ones
# (think of this as "*if* ShortName is TRUE then use state.name *else* use state.abb)
### Many functions in R increase input vectors to the correct size by duplication ###
ifelse(ShortName, state.name, "tooBIG") #A silly example: the 3rd argument is duplicated
size <- ifelse(isSmall, "small", "large") #A more useful example, for both 2nd & 3rd args
size #might be useful as an indicator variable?
ifelse(size=="large", ifelse(isHuge, "huge", "medium"), "small") #A more complex example
> ### In these examples, we'll reuse the American states data, especially the state names
> ### To remind yourself of them, you might want to look at the vector "state.names"
> nchar(state.name) # nchar() returns the number of characters in strings of text ...
> isSmall <- state.area < 10000 #Store a logical vector indicating states <10000 sq. miles
> isHuge <- state.area > 100000 #And another for states >100000 square miles in area
> sum(isSmall) #there are 8 "small" states
> sum(isHuge) #coincidentally, there are also 8 "huge" states
> state.name[isSmall | isHuge] # | means OR. So these are states which are small OR huge
 "New Hampshire" "New Jersey" "New Mexico" "Rhode Island" "Texas"
> state.name[isHuge & ShortName] # & means AND. So these are huge AND with a short name
 "Alaska" "Nevada" "Texas"
> state.name[isHuge & !ShortName]# ! means NOT. So these are huge and with a longer name
 "Arizona" "California" "Colorado" "Montana" "New Mexico"
> ### Examples of ifelse() ###
> ifelse(ShortName, state.name, state.abb) #mix short names with abbreviations for long ones
If you have done any computer programming, you may be more used to dealing with logic in the context of "if" statements. While R also has an if() statement, it is less useful when dealing with vectors. For example, the following R expression
if(aVariable == 0) then print("zero") else print("not zero")
expects aVariable to be a single number: it outputs "zero" if this number is 0, or "not zero" if it is a number other than zero. If aVariable is a vector of 2 values or more, only the first element counts: everything else is ignored. There are also logical operators which ignore everything but the first element of a vector: these are && for AND and || for OR.
When collecting data, it is often the case that certain data points are unknown. This happens for a variety of reasons. For example, when analysing experimental data, we might be recording a number of variables for each experiment (e.g. temperature, time of day, etc.), yet have forgotten (or been unable) to record temperature in one instance. Or when collecting social data on US states, it might be that certain states do not record certain statistics of interest. Another example is the ship passenger data from the sinking of the Titanic, where careful research has identified the ticket class of all 2207 people on board, but not been able to ascertain the age of 10 or so of the victims (see http://www.encyclopedia-titanica.org).
We could just omit missing data, but in many cases, we have information for some variables, but not for others. For example, we might not want to completely omit a US state from an analysis, just because it it missing one particular datum of interest. For this reason, R provides a special value, NA, meaning "not available". Any vector, numeric, character, or logical, can have elements which are NA. These can be identified by the function "is.na".
One important feature of any variable is the values it is allowed to have. For example, a variable such as gender can only take a limited number of values ('Male' and 'Female' in this instance), whereas a variable such as humanHeight can take any numerical value between about 0 and 3 metres. This is the sort of obvious background information that cannot necessarily be inferred from the data, but which can be vital for analysis. Only a limited amount of this information is usually fed directly into statistical analysis software. As always, it's very important to take account of such background information. This can be usually done using that commodity - unavailable to a computer - known as common sense. For example, a computer could be used to perform an analysis of human height without realising that one person has been recorded as (say) 175, rather than 1.75, metres tall. A computer can blindly perform analysis on this variable without noticing the error, even though it is glaringly obvious to a human. That's one of the primary reasons that it is important to plot data before analysis.
Nevertheless, a few bits of information about a variable can (indeed, often must) be given to analysis software. Nearly all statistical software packages require you, at a minimum, to distinguish between categorical variables (e.g. gender) in which each data point takes one of a fixed number of pre-defined "levels", and quantitative variables (e.g. humanHeight) in which each data point is a number on a well-defined scale. Further examples are given in Table 2.1. This distinction is important even for such simple analyses as taking an average: a procedure which is meaningful for a quantitative variable, but rarely for a categorical variable (what is the "average" sex of a 'male' and a 'female'?).
Table 2.1 Examples of types of variables. Many of these variables are used later in this book.
Categorical (also known as "qualitative" variables or "factors")
The variable gender where each data point is the gender of a human (i.e. the levels are 'Male' or 'Female')
The variable class, where each data point is the class of a passenger on a ship (the levels being '1st', '2nd', or '3rd')
Quantitative (also known as "numeric" variables or "covariates")
The variable weightChange, where each point is the weight change of an anorexic patient over a fixed period.
The variable landArea, where each data point is a positive number giving the area of a piece of land.
The variable shrimp, where each data point is the determination of the percentage of shrimp in a preparation of shrimp cocktail (!)
The variable deathsPerYear, where data point is a count of a number of individuals who have died from a particular cause in a particular year.
The variable cases, where each data point is a count of the number of people in a particular group who have a specific disease. This may be meaningless unless we also have some measure of the group size. This can be done by via another variable (often labelled controls) indicating the number of people in each group who have not developed the disease (grouped case/control studies of this sort are common in areas such as medicine).
The variable temperatureF, where each data point is an average monthly temperature in degrees Fahrenheit.
The variable direction, where each data point is a compass direction in degrees.
It is not always immediately obvious from the plain data whether a variable is categorical or quantitative: often this judgement must be made by careful consideration of the context of the data. For example, a variable containing numbers 1, 2, and 3 might seem to be a numerical quantity, but it could just as easily be a categorical variable describing (say) a medical treatment using either drug 1, drug 2, or drug 3. More rarely, a seemingly categorical variable such as colour (levels 'blue', 'green', 'yellow', 'red') might be better represented as a numerical quantity such as the wavelength of light emitted in an experiment. Again, it's your job to make this sort of judgement, on the basis of what you are trying to do.
Despite the importance of the categorical/quantitative distinction (and its prominence in many textbooks), reality is not always so clear-cut. It can sometimes be reasonable to treat categorical variables as quantitative, or vice versa. Perhaps the commonest case is when the levels of a categorical variable seem to have a natural order, such as the class variable in Table 2.1, or the Likert scale often used in questionnaires.
In rare and specific circumstances, and depending on the nature of the question being asked, there may be rough numerical values that can be allocated to each level. For example, maybe a survey question is accompanied by a visual scale on which the Likert categories are marked, from 'absolutely agree' to 'absolutely disagree'. In this case it may be justifiable to convert the categorical variable straight into to a quantitative one.
More commonly, the order of levels is known, but exact values cannot be generally agreed upon. Such categorical variables can be described as ordinal or ranked, as opposed ones such as gender or professedReligion which are purely nominal. Hence we can recognise two major types of categorical variable: ordered ("ordinal") and unordered ("nominal"), as illustrated by the two examples in Table 2.1.
Although the categorical/quantitative division is the most important one, we can further subdivide each type (as we have already seen when discussing categorical variables) . The most commonly taught classification is due to Stevens (1946). As well as dividing categorical variables into ordinal and nominal types, he classified quantitative variables into two types, interval or ratio, depending on the nature of the scale that was used. To this classification can be added circular variables. Hence classifying quantitative variables on the basis of the measurement scale leads to three subdivision (as illustrated by the subdivisions in Table 2.1):
Ratio data is the most commonly encountered. Examples include distances, lengths of time, numbers of items, etc. These variables are measured on a scale with a natural zero point. Because we can
Interval data is measured on a scale where there is no natural zero point. The most common examples are temperature (in degrees Centigrade or Fahrenheit) and calendar date. Since the zero point on the scale is essentially arbitrary, The name comes from the fact that while ratios are not meaningful, intervals are. E.g. means that ****
Circular data is measured on a scale which "wraps around", such as Direction, TimeOfDay, Logitude etc. ***
The Stevens classification is not the only way to categorise quantitative variables. Another sensible division recognises the difference between continuous and discrete measurements. Specifically, quantitative variables can represent either
Continuous data, in which it makes sense to talk about intermediate values (e.g. 1.2 hours, 12.5%, etc.). This includes cases where data have been rounded ***.
"Discrete data", where intermediate values are nonsensical (e.g. doesn't make much sense to talk about 1.2 deaths, or 2.6 cancer cases in a group of 10 people). Often these are counts of things: this is sometime known as meristic data.
In practice, discrete data are often treated as continuous, especially when the units into which they are divided are relatively small. For example, the population size of different countries is theoretically discrete (you can't have half a person), but the values are so huge that it may be reasonable to treat such data as continuous. However, for small values, such as the number of people in a household, the data are rather "granular", and the discrete nature of values becomes very apparent. One common result of this is the presence of multiple repeated values (e.g. there will be a lot of 2 person households in most data sets).
A third way of classifying quantitative variables depends on whether the scale has upper or lower bounds, or even both.
bounded at one end (e.g. landArea cannot be below 0),
at both ends (e.g. percentages cannot be less then 0 or greater than 100). Also see censored data ***
Most important is circular - often requires very different analytical tools. Often best to make linear in some way (e.g. difference from a fixed direction).
Interval data cannot use ratios (division). Rather rare
Bounds: very common to have lower bound. Unusual to have only an upper bound. Both often indicates a percentage. - often treated by transformation (e.g. log)
Count data: if multiple identical values, can affect plotting etc. If true independent counts, indicates error function.
The distinctions between the different types of variables are summarised in Figure ***. Note that it is also common to
Categorical variables in R are stored as a special vector object known as a factor. This is not the same as a character vector filled with a set of names (don't get the two mixed up). In particular, R has to be told that each element can only be one of a number of known levels (e.g. Male or Female). If you try to place a data point with a different, unknown level into the factor, R will complain. When you print a factor to the screen, R will also list the possible levels that factor can take (this may include ones that aren't present)
The factor() function creates a factor and defines the available levels. By default the levels are taken from the ones in the vector***. Actually, you don't often need to use factor(), because when reading data in from a file, R assumes by default that text should be converted to factors (see Statistical Analysis: an Introduction using R/R/Data frames). You may need to use as.factor(). Internally, R stores the levels as numbers from 1 upwards, but it is not always obvious which number corresponds to which level, and it should not normally be necessary to know.
Ordinal variables, that is factors in which the levels have a natural order, are known to R as ordered factors. They can be created in the normal way a factor is created, but in addition specifying ordered=TRUE.
state.region #An example of a factor: note that the levels are printed out
state.name #this is *NOT* a factor
state.name <- "Any text" #you can replace text in a character vector
state.region <- "Any text" #but you can't in a factor
state.region <- "South" #this is OK
state.abb #this is not a factor, just a character vector
character.vector <- c("Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Female" , "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female" , "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female") #a bit tedious to do all that typing
#might be easier to use codes, e.g. 1 for female and 2 for male
Coded <- factor(c(1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1))
Gender <- factor(Coded, labels=c("Female", "Male")) #we can then convert this to named levels
Much statistical theory uses matrix algebra. While this book does not require a detailed understanding of matrices, it is useful to know a little about how R handles them.
Essentially, a matrix (plural: matrices) is the two dimensional equivalent of a vector. In other words, it is a rectangular grid of numbers, arranged in rows and columns. In R, a matrix object can be created by the matrix() function, which takes, as a first argument, a vector of numbers with which the matrix is filled, and as the second and third arguments, the number of rows and the number of columns respectively.
R can also use array objects, which are like matrices, but can have more than 2 dimensions. These are particularly useful for tables: a type of array containing counts of data classified according to various criteria. Examples of these "contingency tables" are the HairEyeColor and Titanic tables shown below.
As with vectors, the indexing operator  can be used to access individual elements or sets of elements in a matrix or array. This is done by separating the numbers inside the brackets by commas. For example, for matrices, you need to specify the row index then a comma, then the column index. If the row index is blank, it is assumed that you want all the rows, and similarly for the columns.
m <- matrix(1:12, 3, 4) #Create a 3x4 matrix filled with numbers 1 to 12
m #Display it!
m*2 #Arithmetic, just like with vectors
m[2,3] #Pick out a single element (2nd row, 3rd column)
m[1:2, 2:4] #Or a range (rows 1 and 2, columns 2, 3, and 4.)
m[,1] #If the row index is missing, assume all rows
m[1,] #Same for columns
m[,2] <- 99 #You can assign values to one or more elements
###Some real data, stored as "arrays"
HairEyeColor #A 3D array
HairEyeColor[,,1] #Select only the males to make it a 2D matrix
Titanic #A 4D array
Titanic[1:3,"Male","Adult",] #A matrix of only the adult male passengers
> m <- matrix(1:12, 3, 4) #Create a 3x4 matrix filled with numbers 1 to 12
> m #Display it!
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> m*2 #Arithmetic, just like with vectors
[,1] [,2] [,3] [,4]
[1,] 2 8 14 20
[2,] 4 10 16 22
[3,] 6 12 18 24
> m[2,3] #Pick out a single element (2nd row, 3rd column)
> m[1:2, 2:4] #Or a range (rows 1 and 2, columns 2, 3, and 4.)
[,1] [,2] [,3]
[1,] 4 7 10
[2,] 5 8 11
> m[,1] #If the row index is missing, assume all rows
 1 2 3
> m[1,] #Same for columns
 1 4 7 10
> m[,2] <- 99 #You can assign values to one or more elements
> m #See!
[,1] [,2] [,3] [,4]
[1,] 1 99 7 10
[2,] 2 99 8 11
[3,] 3 99 9 12
> ###Some real data, stored as "arrays"
> HairEyeColor #A 3D array
, , Sex = Male
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
> HairEyeColor[,,1] #Select only the males to make it a 2D matrix
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
> Titanic #A 4D array
, , Age = Child, Survived = No
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
, , Age = Child, Survived = Yes
Class Male Female
1st 5 1
2nd 11 13
3rd 13 14
Crew 0 0
, , Age = Adult, Survived = Yes
Class Male Female
1st 57 140
2nd 14 80
3rd 75 76
Crew 192 20
> Titanic[1:3,"Male","Adult",] #A matrix of only the adult male passengers
Before carrying out a formal analysis, you should always perform an Initial Data Analysis, part of which is to inspect the variables that are to be analysed. If there are only a few data points, the numbers can be scanned by eye, but normally it is easier to inspect data by plotting.
Scatter plots, such as those in Chapter 1, are perhaps the most familiar sort of plot, and are useful for showing patterns of association between two variables. These are discussed later in this chapter, but in this section we first examine the various ways of visualising a single variable.
Plots of a single variable, or univariate plots are particularly used to explore the distribution of a variable; that is its shape and position. Apart from initial inspection of data, one very common use of these plots is to look at the residuals (see Figure 1.2): the unexplained part of the data that remains after fitting a statistical model. Assumptions about the distribution of these residuals are often checked by plotting them.
The plots which follow illustrate a few of the more common types of univariate plot. The classic text is Tufte (cite: the visual display of quantitative information).
For categorical variables, the choice of plots is quite simple. The most basic plots simply involve counting up the data points at each level.
Figure 2.1: Categorical data plots
(a) A simple bar chart, giving number of data points in different categories
(b) Alternative, for highly ordered data: symbols mark number of data points, lines imply intermediate values are conceptually possible
Figure 2.1(a) displays these counts as a bar chart; another possibility is to use points as in Figure 2.1(b). In the case of gender, the order of the levels is not important: either 'male' or 'female' could come first. In the case of class, the natural order of the levels is used in the plot. In the extreme case, where intermediate levels might be meaningful, or where you wish to emphasise the pattern between levels, it may be reasonable to connect points by lines. For illustrative purposes, this has been done in Figure 2.1(b), although the reader may question whether it is appropriate in this instance.
In some cases we may be interested in the actual sequence of data points. This is particularly so for time series data, but may also be relevant elsewhere. For instance, in the case of gender, the data was recorded in the order that each child was born. If we think that the preceeding birth influences the following birth (unlikely in this case, but just within the realm of possibility if pheremones are involved), then we might want to do Symbol-by-Symbol plot . If we are looking for associations with time, however, then a bivariate plot may be more appropriate ***, or there are particular features of the data that we are interested in (e.g. repeat rate), then other possibilities exist (doi:10.1016/j.stamet.2007.05.001).See chapter on time series
Table 2.2: Land area (km2) of the 50 states of the USA
Table 2.3: A classic data set of the number of accidental deaths by horse-kick in 14 cavalry corps of the Prussian army from 1875-1894
A quantitative variable can be plotted in many more ways than a categorical variable. Some of the most common single-variable plots are discussed below, using the land area of the 50 US states as our example of a continuous variable, and a famous data set of the number of deaths by horse kick as our example of a discrete variable. These data are tabulated in Tables 2.2 and 2.3
Some sorts of data consist of many data points with identical values. This is particularly true for count data where there are low numbers of counts (e.g. number of offspring).
There are 3 things we might want to look for in these sorts of plots
points that seem extreme in some way (these are known as outliers). Outliers often reveal mistakes in data collection, and even if they don't, they can have a disproportionate effect on further analysis. If it turns out that they aren't due to an obvious mistake, one option is to remove them from the analysis, but this causes problems of its own/
shape & position of the distribution (e.g. normal, bimodal, etc.)
similarity to known distributions (QQ)
We'll keep the focus mostly on the variable "landArea" ***
Figure 2.3: Basic plots for continuous quantitative variables
(a) 1D scatterplot
(b) 1D scatterplot (jittered)
(b) Colours illustrate upper (red) and lower (green) quartiles
(c) Basic boxplot. The box displays the median and upper and lower quartiles, while the "whiskers" show the full range (from the minimum to the maximum) of the data.
(c) Boxplot coloured to correspond to Figure 2.3b
(d) Sophisticated boxplot. Outliers have been excluded from the whiskers, and plotted separately. The notch in the box gives an idea of our confidence in the value of the median.
The simplest way to represent quantitative data is to plot the points on a line, as in Figure 2.3(a). This is often called a 'dot plot, although this is also sometimes used to describe a number of other types of plot (e.g. Figure 2.7). To avoid confusion, it may be best to call it a one-dimensional scatterplot. As well as simplicity, there are two advantages to a 1D scatterplot
All the information present in the data is retained.
Outliers are easily identified. Indeed, it is often useful to be able to identify which data points are outliers. Some software packages allow you to identify points interactively (e.g. by clicking points on the plot). Otherwise, points can be labelled, as has been done for some outlying points in Figure 2.3a.
One dimensional scatterplots do not work so well for large datasets. Figure 2.3(a) consists of only 50 points. Even so, it is difficult to get an overall impression of the data, to (as the saying goes) "see the wood for the trees". This is partly because some points lie almost on top of each other, but also because of the sheer number of closely placed points. It is often the case that features of the data are best explored by summarising it in some way.
Figure 2.3(b) shows a step on the way to a better plot. To alleviate the problem of points obscuring each other, they have been displaced, or jittered sideways by a small, random amount. More importantly, the data have been summarised by dividing into quartiles (and coloured accordingly, for ease of explanation). The quarter of states with the largest area have been coloured red. The smallest quarter of states have been coloured green.
More generally, we can talk about the quantiles of our data. The red line represents the 75% quantile: 75% of the points lie below it. The green line represents the 25% quantile: 25% of the points lie below it. The distance between these lines is known as the Interquartile Range (IQR), and is a measure of the spread of the data. The thick black line has a special name: the median. It marks the middle of the data, the 50% quantile: 50% of the points lie above, and 50% lie below. A major advantage of quantiles over other summary measures is that they are relatively insensitive to outliers, or changes in scale ****.
Figure 2.3(c) is a coloured version of a widely used statistical summary plot: the boxplot. Here it has been coloured to show the correspondence to Figure 2.3(b). The box marks the quartiles of the data, with the median marked within the box. If the median is not positioned centrally within the box, this is often an indicator that the data are skewed in some way. The lines on either side of the box are known as "whiskers", and summarise the data which lies outside of the upper and lower quartiles. In this case, the whiskers have simply been extended to the maximum and minimum observed values.
Figure 2.3(d) is a more sophisticated boxplot of the same data. Here, notches have been drawn on the box: these are useful for comparing the medians in different boxplots. The whiskers have been shortened so that they do not include points considered as outliers. There are various ways of defining these outliers automatically. This figure is based on a convention that considers outlying points as those more than one and a half times the IQR from either side of the box. However it is often more informative to identify and inspect interesting points (including outliers) by visual inspection. For example, in Figure 2.1a it is clear that Alaska and (to a lesser extent) Texas are unusually large states, but that California (identified as an outlier by this automatic procedure) is not so set-apart from the rest.
Figure 2.4: Basic plots for discrete quantitative variables
(b) Stacked (or see line equivalent in fig 2.5a)
(c) Boxplots are also reasonable for discrete data
(d) Points merged. Here the area (not width or height) of each point is relative to the number of plotting points at each location. It is more difficult to judge the exact numbers, and so is more useful in plots such as Snow's modified cholera plot (Chapter 1), where locations are given more weight depending on the number of points at each ***
One problem with plotting on a single line is that, if points are repeated, there ***. This is particularly problematic for discrete data. NB, there is no particular reason (or established convention) for these plots to be vertical. Figure 2.2 shows . The stacked plot (Figure 2.4d) is similar to a histogram (Figure 2.5).
Figure 2.5: Histograms
(a) Histogram-like line plot (compare to Fig 2.4b)
(b) As a histogram, years merged in 2s
(b) Colours illustrate upper (red) and lower (green) quartiles
(c) Histogram for continuous data
(c) Actual points marked
(d) Smaller bin widths
(d) Actual points marked
This gives another way of picturing the median & other quantiles: as dividing the area into sections ***
Figure 2.6: Density plots
(a) Kernel density plot
(b) Kernel density plot (can be misleading for e.g. uniform distributions)
We can space out the points along the other axis. For example, if the order of points in the dataset is meaningful, we can just plot each point in turn. This is true for whatsit's horse-kick data. The data by year are plotted in Figure 2.6.
Figure 2.7: Ordering data points
(a) Simple time series plot (i.e. sorted by date/time)
(b) Quantile plot (the same points ordered by size)
(b) Quartiles illustrated with corresponding boxplot
(c) Empirical cumulative distribution function (as (b) but with axes swapped and rescaled)
One thing we can always do is to sort the data points by their value, plotting the smallest first, etc. This is seen in Figure 2.3b. If all the data points were equally spaced (and excluded each other****), we would see a straight line. The plot for the logged variables shows that this transformation has evened out the spacing somewhat. This is called a quantile plot, for the following reason
when the axes are swapped, this is called the empirical cumulative distribution function. The unswapped data is useful for understanding qq plots. Also for understanding quantiles. median, etc.
We could put a scale break in, but a better option is usually to transform the variable.
Figure 2.8: The effect of log transformation
Sometimes, plotting on a different scale (e.g. a logarithmic scale) can be more informative. We can visualise this either as a plot with a non-linear (e.g. logarithmic) axis, or as a conventional plot of a transformed variable (e.g. a plot of log(my.variable) on a standard, linear axis). Figure 2.1(b) illustrates this point: the left axis marks the state areas, the right axis marks the logarithm of the state areas. This sort of rescaling can highlight quite different features of a variable. In this case, it seems clear that there are a batch of nine states which seem distinctly smaller than most, and while Alaska still seems extraordinarily large, Texas does not seem so unusual in that respect. This is also reflected by the automatic labelling of outliers in the log-transformed variable.
It is particularly common for smaller numbers to have greater resolution. As discussed in later chapters ***, logarithmic scales are particularly useful for multiplicative data ***.
Figure 2.8: The effect of square root transformation
There are other common transformations, for example, the square-root transformation (often used for count data). This may be more appropriate for state areas, if the limiting factor for state size is (e.g.) the distance across the state, or factors associated with it (e.g. the length of time to cross from one side to another). Figure 2.1c shows a sqrt rescaling of the data. You can see that in some sense this is less extreme than the log transformation...
↑The convention (which is followed here) is to write variable names in italics
↑These are special words in R, and cannot be used as names for objects. The objects T and F are temporary shortcuts for TRUE and FALSE, but if you use them, watch out: since T and F are just normal object names you can change their meaning by overwriting them.
↑if the logical vector is shorter then the original vector, then it is sequentially repeated until it is of the right length
↑Note that, when using continuous (fractional) numbers, rounding error may mean that results of calculations are not exactly equal to each other, even if they seem as if they should be. For this reason, you should be careful when using == with continuous numbers. R provides the function all.equal to help in this case
↑There are actually 3 types of allowed numbers: "normal" numbers, complex numbers, and simple integers. This book deals almost exclusively with the first of these.
↑This is not quite true, but unless you are a computer specialist, you are unlikely to use the final type: a vectors of elements storing "raw" computer bits, see ?raw
↑This dataset was collected by the Russian economist von Bortkiewicz, in 1898, to illustrate the pattern seen when events occur independently of each other (this is known as a Poisson distribution). The table here gives the total number of deaths summed over all 14 corps. For the full dataset, broken down into corp, see Statistical Analysis: an Introduction using R/Datasets
↑Authors such as Wild & Seber call it a dot plot, but R uses the term "dotplot" to refer to a Cleveland (1985) dot-plot as shown in Figure 2.1(b). Other authors (***) specifically use it to refer to a sorted ("quantile") plot as used in Figure *** (cite)
↑Labels often obscure the plot, so For plots intended only for viewing on a computer, it is possible to print labels in a size so small that they can only be seen when zooming in.
↑There a many different methods for calculating the precise value of a quantile, when it lies between two points. See Hyndman and Fan (1996)
Carifio, J. & Perla, R. (2007). Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes. Journal of Social Sciences, 2, 106-116. http://www.scipub.org/fulltext/jss/jss33106-116.pdf