# Statistical Analysis: an Introduction using R/R/R fundamentals

## R fundamentals

If you carry out the exercises in all these topics, you should be relatively competent in using R (also see programming)

Text marked like this is used to discuss an R-specific point. The basics of R can be learned by reading these sections in the order they appear in the book. There will also be commands that can be entered directly into R; you should be able to copy-and-paste them directly into your R session. Try the following to see how to use R as a simple calculator
###### Input:
```1 100+2/3
```
###### Result:
```> 100+2/3
 100.6667
```
In the absence of any instructions of what to do with the output of a command, R usually prints the result to the screen. For the time being, ignore the  before the answer: we will see that this is useful when R outputs many numbers at once. Note that R respects the standard mathematical rules of carrying out multiplication and division before addition and subtraction: it divides 2 by 3 before adding 100.
R commands can sometimes be rather difficult to follow, so occasionally it can be useful to annotate them with comments. This can be done by typing a hash (#) character: any further text on the same line is ignored by R. This will be used extensively in the R examples in this wikibook, e.g.
###### Input:
```1 #this is a comment: R will ignore it
2 (100+2)/3    #You can use round brackets to group operations so that they are carried out first
3 5*10^2       #The symbol * means multiply, and ^ means "to the power", so this gives 5 times (10 squared), i.e. 500
4 1/0          #R knows about infinity (and minus infinity)
5 0/0          #undefined results take the value NaN ("not a number")
6 (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers
```
###### Result:
```> #this is a comment: R will ignore it
> (100+2)/3    #You can use round brackets to group operations so that they are carried out first
 34
> 5*10^2       #The symbol * means multiply, and ^ means "to the power", so this is 5 times (10 squared)
 500
> 1/0          #R knows about infinity (and minus infinity)
 Inf
> 0/0          #undefined results take the value NaN ("not a number")
 NaN
> (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers
 0+3i
```
• If you don't know anything about complex numbers, don't worry: they are not important here.
• Note that you can't use curly brackets {} or square brackets [] to group operations together

R is what is known as an "object-oriented" program. Everything (including the numbers you have just typed) is a type of object. Later we will see why this concept is so useful. For the time being, you need only note that you can give a name to an object, which has the effect of storing it for later use. Names can be assigned by using the arrow-like signs `<-` and `->` as demonstrated in the exercise below. Which sign you use depends on whether you prefer putting the name first or last (it may be helpful to think of `->` as "put into" and `<-` as "set to").

Unlike many statistical packages, R does not usually display the results of analyses you perform. Instead, analyses usually end up by producing an object which can be stored. Results can then be obtained from the object at leisure. For this reason, when doing statistics in R, you will often find yourself naming and storing objects. The name you choose should consist of letters, numbers, and the "." character, and should not start with a number.

###### Input:
```1 0.001 -> small.num                #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)
2 big.num <- 10 * 100               #You can put the name first if you reverse the arrow (set big.num to 10000).
3 big.num+small.num+1               #Now you can treat big.num and small.num as numbers, and use them in calculations
4 my.result <- big.num+small.num+2  #And you can store the result of any calculation
5 my.result                         #To look at the stored object, just type its name
6 pi                                #There are some named objects that R provides for you
```
###### Result:
```> 0.001 -> small.num                #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)
> big.num <- 10 * 100               #You can put the name first if you reverse the arrow (set big.num to 10000).
> big.num+small.num+1               #Now you can treat big.num and small.num as numbers, and use them in calculations
 1001.001
> my.result <- big.num+small.num+2  #And you can store the result of any calculation
> my.result                         #To look at the stored object, just type its name
 1002.001
> pi                                #There are some named objects that R provides for you
 3.141593
```
Note that when the end result of a command is to store (assign) an object, as on input lines 1, 2, and 4, R doesn't print anything to the screen.

Apart from numbers, perhaps the most useful named objects in R are functions. Nearly everything useful that you will do in R is carried out using a function, and many are available in R by default. You can use (or "call") a function by typing its name followed by a pair of round brackets. For instance, the start up text mentions the following function, which you might find useful if you want to reference R in published work:
###### Input:
```1 citation()
```
###### Result:
```> citation()

To cite R in publications use:

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

A BibTeX entry for LaTeX users is

@Manual{,
url = {http://www.R-project.org},
title = {R: A Language and Environment for Statistical Computing},
author = {{R Development Core Team}},
organization = {R Foundation for Statistical Computing},
year = {2008},
note = {{ISBN} 3-900051-07-0},
}

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.
```
Many R functions can produce results which differ depending on arguments that you provide to them. Arguments are placed inside the round brackets, separated by commas. Many functions have one or more optional arguments: that is, you can choose whether or not to provide them. An example of this is the `citation()` function. It can take an optional argument giving the name of an R add-on package. If you do not provide an optional argument, there is usually an assumed default value (in the case of `citation()`, this default value is `"base"`, i.e. provide the citation reference for the base package: the package which provides most of the foundations of the R language).

Most arguments to a function are named. For example, the first argument of the citation function is named package. To provide extra clarity, when using a function you can provide arguments in the longer form name=value. Thus

```citation("base")
```

does the same as

```citation(package="base")
```

If a function can take more than one argument, using the long form also allows you to change the order of arguments, as shown in the example code below.

###### Input:
``` 1 citation("base")      #Does the same as citation(), because the default for the first argument is "base"
2                       #Note: quotation marks are needed in this particular case (see discussion below)
3 citation("datasets")  #Find the citation for another package (in this case, the result is very similar)
4 sqrt(25)              #A different function: "sqrt" takes a single argument, returning its square root.
5 sqrt(25-9)            #An argument can contain arithmetic and so forth
6 sqrt(25-9)+100        #The result of a function can be used as part of a further analysis
7 max(-10, 0.2, 4.5)    #This function returns the maximum value of all its arguments
8 sqrt(2 * max(-10, 0.2, 4.5))             #You can use results of functions as arguments to other functions
9 x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100  #... and you can store the results of any of these calculations
10 x
11 log(100)              #This function returns the logarithm of its first argument
12 log(2.718282)         #By default this is the natural logarithm (base "e")
13 log(100, base=10)     #But you can change the base of the logarithm using the "base" argument
14 log(100, 10)          #This does the same, because "base" is the second argument of the log function
15 log(base=10, 100)     #To have the base as the first argument, you have to use the form name=value
```
###### Result:
```> citation("base")      #Does the same as citation(), because the default for the first argument is "base"

To cite R in publications use:

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

A BibTeX entry for LaTeX users is

@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Development Core Team}},
organization = {R Foundation for Statistical Computing},
year = {2008},
note = {{ISBN} 3-900051-07-0},
url = {http://www.R-project.org},
}

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.

>                       #Note: quotation marks are needed in this particular case (see discussion below)
> citation("datasets")  #Find the citation for another package (in this case, the result is very similar)

The 'datasets' package is part of R.  To cite R in publications use:

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

A BibTeX entry for LaTeX users is

@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Development Core Team}},
organization = {R Foundation for Statistical Computing},
year = {2008},
note = {{ISBN} 3-900051-07-0},
url = {http://www.R-project.org},
}

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.

> sqrt(25)              #A different function: "sqrt" takes a single argument, returning its square root.
 5
> sqrt(25-9)            #An argument can contain arithmetic and so forth
 4
> sqrt(25-9)+100        #The result of a function can be used as part of a further analysis
 104
> max(-10, 0.2, 4.5)    #This function returns the maximum value of all its arguments
 4.5
> sqrt(2 * max(-10, 0.2, 4.5))             #You can use results of functions as arguments to other functions
 3
> x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100  #... and you can store the results of any of these calculations
> x
 103
> log(100)              #This function returns the logarithm of its first argument
 4.60517
> log(2.718282)         #By default this is the natural logarithm (base "e")
 1
> log(100, base=10)     #But you can change the base of the logarithm using the "base" argument
 2
> log(100, 10)          #This does the same, because "base" is the second argument of the log function
 2
> log(base=10, 100)     #To have the base as the first argument, you have to use the form name=value
 2
```

Note that when typing normal text (as in the name of a package), it needs to be surrounded by quotation marks, to avoid confusion with the names of objects. In other words, in R

```citation
```

refers to a function, whereas

```"citation"
```

is a "string" of text. This is useful, for example when providing titles for plots, etc.

You will probably find that one of the trickiest aspects of getting to know R is knowing which function to use in a particular situation. Fortunately, R not only provides documentation for all its functions, but also ways of searching through the documentation, as well as other ways of getting help.
There are a number of ways to get help in R, and there is also a wide variety of online information. Most installations of R come with a reasonably detailed help file called "An Introduction to R", but this can be rather technical for first-time users of a statistics package. Almost all functions and other objects that are automatically provided in R have a help page which gives intricate details about how to use them. These help pages usually also contain examples, which can be particularly helpful for new users. However, if you don't know the name of what you are looking for, then finding help may not be so easy, although it is possible to search for keywords and concepts that are associated with objects.

Some versions of R give easy access to help files without having to type in commands (for example, versions which provide menu bars usually have a "help" menu, and the Macintosh interface also has a help box in the top right hand corner). However, this functionality can always be accessed by typing in the appropriate commands. You might like to type some or all of the following into an R session (no output is listed here because the result will depend on your R system).

```1 help.start()            #A web-based set of help pages (try the link to "An Introduction to R")
2 help(sqrt)              #Show details of the "sqrt" and similar functions
3 ?sqrt                   #A shortcut to do the same thing
4 example(sqrt)           #run the examples on the bottom of the help page for "sqrt"
5 help.search("maximum")  #gives a list of functions involving the word "maximum", but oddly, "max" is not in there!
6 ### The next line is commented out to reduce internet load. To try it, remove the first # sign.
7 #RSiteSearch("maximum")  #search the R web site for anything to do with "maximum". Probably overkill here!
```
The last but one command illustrates a problem you may come across with using the R help functions. The searching facility for help files is sometimes a bit hit-and-miss. If you can't find exactly what you are looking for, it is often useful to look at the "See also" section of any help files that sound vaguely similar or relevant. In this case, you might probably eventually find the `max()` function by looking at the "See also" section of the help file for `which.max()`. Not ideal!.
One of the most fundamental objects in R is the vector, used to store multiple measurements of the same type (e.g. data variables). There are several different sorts of data that can be stored in a vector. Most common is the numeric vector, in which each element of the vector is simply a number. Other commonly used types of vector are character vectors (where each element is a piece of text) and logical vectors (where each element is either `TRUE` or `FALSE`). In this topic we will use some example vectors provided by the "datasets" package, containing data on States of the USA (see `?state`).

R is an inherently vector-based program; in fact the numbers we have been using in previous calculations are just treated as vectors with a single element. This means that most basic functions in R will behave sensibly when given a vector as a argument, as shown below.

###### Input:
```1 state.area                #a NUMERIC vector giving the area of US states, in square miles
2 state.name                #a CHARACTER vector (note the quote marks) of state names
3 sq.km <- state.area*2.59  #Arithmetic works on numeric vectors, e.g. convert sq miles to sq km
4 sq.km                     #... the new vector has the calculation applied to each element in turn
5 sqrt(sq.km)               #Many mathematical functions also apply to each element in turn
6 range(state.area)         #But some functions return different length vectors (here, just the max & min).
7 length(state.area)        #and some, like this useful one, just return a single value.
```
###### Result:
```> state.area                #a NUMERIC vector giving the area of US states, in square miles
  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876   6450  83557  56400
  36291  56290  82264  40395  48523  33215  10577   8257  58216  84068  47716  69686 147138
  77227 110540   9304   7836 121666  49576  52586  70665  41222  69919  96981  45333   1214
  31055  77047  42244 267339  84916   9609  40815  68192  24181  56154  97914
> state.name                #a CHARACTER vector (note the quote marks) of state names
 "Florida"            "Georgia"            "Hawaii"             "Idaho"
 "Illinois"           "Indiana"            "Iowa"               "Kansas"
 "Kentucky"           "Louisiana"          "Maine"              "Maryland"
 "Massachusetts"      "Michigan"           "Minnesota"          "Mississippi"
 "New Hampshire"      "New Jersey"         "New Mexico"         "New York"
 "North Carolina"     "North Dakota"       "Ohio"               "Oklahoma"
 "Oregon"             "Pennsylvania"       "The smallest state" "South Carolina"
 "South Dakota"       "Tennessee"          "Texas"              "Utah"
 "Vermont"            "Virginia"           "Washington"         "West Virginia"
 "Wisconsin"          "Wyoming"
> sq.km <- state.area*2.59  #Standard arithmatic works on numeric vectors, e.g. convert sq miles to sq km
> sq.km                     #... giving another vector with the calculation performed on each element in turn
  133667.31 1527470.63  295024.31  137539.36  411014.87  269999.73   12973.31    5327.63
  151670.40  152488.84   16705.50  216412.63  146076.00   93993.69  145791.10  213063.76
  104623.05  125674.57   86026.85   27394.43   21385.63  150779.44  217736.12  123584.44
  180486.74  381087.42  200017.93  286298.60   24097.36   20295.24  315114.94  128401.84
  136197.74  183022.35  106764.98  181090.21  251180.79  117412.47    3144.26   80432.45
  199551.73  109411.96  692408.01  219932.44   24887.31  105710.85  176617.28   62628.79
  145438.86  253597.26
> sqrt(sq.km)               #Many mathematical functions also apply to each element in turn
  365.60540 1235.90883  543.16140  370.86299  641.10441  519.61498  113.90044   72.99062
  389.44884  390.49819  129.24976  465.20171  382.19890  306.58390  381.82601  461.58830
  323.45487  354.50609  293.30334  165.51263  146.23826  388.30328  466.62203  351.54579
  424.83731  617.32278  447.23364  535.06878  155.23324  142.46136  561.35100  358.33202
  369.04978  427.81111  326.74911  425.54695  501.17940  342.65503   56.07370  283.60615
  446.71213  330.77479  832.11058  468.96955  157.75712  325.13205  420.25859  250.25745
  381.36447  503.58441
> range(state.area)         #But some functions return different length vectors (here, just the max & min).
   1214 589757
> length(state.area)        #and some, like this useful one, just return a single value.
 50
```
Note that the first part of your output may look slightly different to that above. Depending on the width of your screen, the number of elements printed on each line of output may differ. This is the reason for the numbers in square brackets, which are produced when vectors are printed to the screen. These bracketed numbers give the position of the first element on that line, which is a useful visual aid. For instance, looking at the printout of state.name, and counting across from the second line, we can tell that the eighth state is Delaware.
You may occasionally need to create your own vectors from scratch (although most vectors are obtained from processing data in already-existing files). The most commonly used function for constructing vectors is `c()`, so named because it concatenates objects together. However, if you wish to create vectors consisting of regular sequences of numbers (e.g. 2,4,6,8,10,12, or 1,1,2,2,1,1,2,2) there are several alternative functions you can use, including `seq()`, `rep()`, and the `:` operator.
###### Input:
``` 1 c("one", "two", "three", "pi")  #Make a character vector
2 c(1,2,3,pi)                     #Make a numeric vector
3 seq(1,3)                        #Create a sequence of numbers
4 1:3                             #A shortcut for the same thing (but less flexible)
5 i <- 1:3                        #You can store a vector
6 i
7 i <- c(i,pi)                    #To add more elements, you must assign again, e.g. using c()
8 i
9 i <- c(i, "text")               #A vector cannot contain different data types, so ...
10 i                               #... R converts all elements to the same type
11 i+1                             #The numbers are now strings of text: arithmetic is impossible
12 rep(1, 10)                      #The "rep" function repeats its first argument
13 rep(3:1,10)                     #The first argument can also be a vector
14 huge.vector <- 0:(10^7)         #R can easily cope with very big vectors
15 #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
16 rm(huge.vector)                 #"rm" removes objects. Deleting huge unused objects is sensible
```
###### Result:
```> c("one", "two", "three", "pi")  #Make a character vector
 "one"   "two"   "three" "pi"
> c(1,2,3,pi)                     #Make a numeric vector
 1.000000 2.000000 3.000000 3.141593
> seq(1,3)                        #Create a sequence of numbers
 1 2 3
> 1:3                             #A shortcut for the same thing (but less flexible)
 1 2 3
> i <- 1:3                        #You can store a vector
> i
 1 2 3
> i <- c(i,pi)                    #To add more elements, you must assign again, e.g. using c()
> i
 1.000000 2.000000 3.000000 3.141593
> i <- c(i, "text")               #A vector cannot contain different data types, so ...
> i                               #... R converts all elements to the same type
 "1"                "2"                "3"                "3.14159265358979" "text"
> i+1                             #The numbers are now strings of text: arithmetic is impossible
Error in i + 1 : non-numeric argument to binary operator
> rep(1, 10)                      #The "rep" function repeats its first argument
 1 1 1 1 1 1 1 1 1 1
> rep(3:1,10)                     #The first argument can also be a vector
 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1
> huge.vector <- 0:(10^7)         #R can easily cope with very big vectors
> #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
> rm(huge.vector)                 #"rm" removes objects. Deleting huge unused objects is sensible
```
Categorical variables in R are stored as a special vector object known as a factor. This is not the same as a character vector filled with a set of names (don't get the two mixed up). In particular, R has to be told that each element can only be one of a number of known levels (e.g. Male or Female). If you try to place a data point with a different, unknown level into the factor, R will complain. When you print a factor to the screen, R will also list the possible levels that factor can take (this may include ones that aren't present)

The `factor()` function creates a factor and defines the available levels. By default the levels are taken from the ones in the vector***. Actually, you don't often need to use `factor()`, because when reading data in from a file, R assumes by default that text should be converted to factors (see Statistical Analysis: an Introduction using R/R/R/Data frames). You may need to use `as.factor()`. Internally, R stores the levels as numbers from 1 upwards, but it is not always obvious which number corresponds to which level, and it should not normally be necessary to know.

Ordinal variables, that is factors in which the levels have a natural order, are known to R as ordered factors. They can be created in the normal way a factor is created, but in addition specifying `ordered=TRUE`.

###### Input:
```state.region #An example of a factor: note that the levels are printed out
state.name  #this is *NOT* a factor
state.name <- "Any text" #you can replace text in a character vector
state.region <- "Any text" #but you can't in a factor
state.region <- "South"  #this is OK
state.abb #this is not a factor, just a character vector
character.vector <- c("Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Female" , "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female" , "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female") #a bit tedious to do all that typing
#might be easier to use codes, e.g. 1 for female and 2 for male
Coded <- factor(c(1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1))
Gender <- factor(Coded, labels=c("Female", "Male")) #we can then convert this to named levels
```
###### Result:
When collecting data, it is often the case that certain data points are unknown. This happens for a variety of reasons. For example, when analysing experimental data, we might be recording a number of variables for each experiment (e.g. temperature, time of day, etc.), yet have forgotten (or been unable) to record temperature in one instance. Or when collecting social data on US states, it might be that certain states do not record certain statistics of interest. Another example is the ship passenger data from the sinking of the Titanic, where careful research has identified the ticket class of all 2207 people on board, but not been able to ascertain the age of 10 or so of the victims (see http://www.encyclopedia-titanica.org).

We could just omit missing data, but in many cases, we have information for some variables, but not for others. For example, we might not want to completely omit a US state from an analysis, just because it it missing one particular datum of interest. For this reason, R provides a special value, NA, meaning "not available". Any vector, numeric, character, or logical, can have elements which are NA. These can be identified by the function "is.na".

###### Input:
```1 some.missing <- c(1,NA)
2 is.na(some.missing)
```
###### Result:
```some.missing <- c(1,NA)
is.na(some.missing)
 FALSE  TRUE
```
Note that some analyses are hard to do if there are missing data. You can use "complete.cases" or "na.omit" to construct datasets with the missing values omitted.
R is very particular about what can be contained in a vector. All the elements need to be of the same type, an moreover must be either types of number, logical values, or strings of text.

If you want a collection of elements which are of different types, or not of one of the allowed vector types, you need to use a list.

###### Input:
```1 l1 <- list(a=1, b=1:3)
2 l2 <- c(sqrt, log) #
```
###### Result:
Producing rough plots in R is extremely easy, although it can be time consuming tweaking them to get a certain look. The defaults are usually sensible.
###### Input:
```1 stripchart(state.areas, xlab="Area (sq. miles)") #see method="stack" & method="jitter" for others
2 boxplot(sqrt(state.area))
3 hist(sqrt(state.area))
4 hist(sqrt(state.area), 25)
5 plot(density(sqrt(state.area))
6 plot(UKDriverDeaths)
7
8 qqnorm()
9 ecdf(
```
###### Result:

Statistical Analysis: an Introduction using R/R/Bivariate plots

1. Depending on how you are viewing this book, may see a ">" character in front of each command. This is not part of the command to type: it is produced by R itself to prompt you to type something. This character should be automatically omitted if you are copying and pasting from the online version of this book, but if you are reading the paper or pdf version, you should omit the ">" prompt when typing into R.
2. If you are familiar with computer programming languages, you may be used to using the underscore ("_") character in names. In R, "." is usually used in its place.
3. you can use either single (') or double (") quotes to delimit text strings, as long as the start and end quotes match
4. These are special words in R, and cannot be used as names for objects. The objects `T` and `F` are temporary shortcuts for `TRUE` and `FALSE`, but if you use them, watch out: since T and F are just normal object names you can change their meaning by overwriting them.
5. There are actually 3 types of allowed numbers: "normal" numbers, complex numbers, and simple integers. This book deals almost exclusively with the first of these.
6. This is not quite true, but unless you are a computer specialist, you are unlikely to use the final type: a vectors of elements storing "raw" computer bits, see `?raw`