Statistical Analysis: an Introduction using R/R basics
Why R?[edit]
R is a commanddriven statistical package. At first sight, this can make it rather daunting to use. However, there are a number of reasons to learn statistics using this computer program. The two most important are:

 R is free; you can download it from http://www.rproject.org and install it onto just about any sort of computer you like.
 R allows you to do all the statistical tests you are likely to need, from simple to highly advanced ones. This means that you should always be able to perform the right analysis on your data.
An additional bonus is that R has excellent graphics and programming capabilities, so can be used as an aid to teaching and learning. For example, all the illustrations in this book have been produced using R; by clicking on any illustration, you can obtain the R commands used to produce it.
A final benefit, which is of more use once you have some basic knowledge of either statistics or R, is that there are many online resources to help users of R. A list is available in the appendix to this book.
How to use this book with R[edit]
The main text in this book describes the why and how of statistics, which is relevant whatever statistical package you use. However, alongside the main text, there are a large number of "R topics": exercises and examples that use R to illustrate particular points. You may find that it takes some time to get used to R, especially if you are unfamiliar with the idea of computer languages.
Don't worry! The topics in this chapter and in Chapter 2 should get you going, to the point where you can understand and use R's basic functionality. This chapter is intended to get you started: once you have installed R, there are topics on how to carry out simple calculations and use functions, how to store results, how to get help, and how to quit. The few exercises in Chapter 1 mainly show the possibilities open to you when using R, then Chapter 2 introduces the nuts and bolts of R usage: in particular vectors and factors, reading data into data frames, and plotting of various sorts. From then on, the exercises become more statistical in nature.
If you wish to work straight through these initial exercises before statistical discussion, they are collected here. Note that when working through R topics online, you may find it more visually appealing if you set up wikibooks to display R commands nicely. If the R topics get in the way of reading the main text, they can be hidden by clicking on the arrow at the top right of each box.
Starting R[edit]
If you don't already have R installed on your computer, download the latest version for free from http://www.rproject.org, and install the base system. You don't need to install any extra packages yet. Once you have installed it, start it up, and you should be presented with something like this:
R version 2.11.1 (20100531) Copyright (C) 2010 The R Foundation for Statistical Computing ISBN 3900051070 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for online help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
You are now in an R session. R is a commanddriven program, and the ominouslooking ">" character means that R is now waiting for you to type something. Don't be daunted. You will soon get the hang of the simplest commands, and that is all you should need for the moment. And you will eventually find that the commandline driven interface gives you a degree of freedom and power^{[1]} that is impossible to achieve using more "userfriendly" packages.
R as a calculator
Input:
1 100+2/3
Result:
> 100+2/3 [1] 100.6667
Input:
1 #this is a comment: R will ignore it
2 (100+2)/3 #You can use round brackets to group operations so that they are carried out first
3 5*10^2 #The symbol * means multiply, and ^ means "to the power", so this gives 5 times (10 squared), i.e. 500
4 1/0 #R knows about infinity (and minus infinity)
5 0/0 #undefined results take the value NaN ("not a number")
6 (0i9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers
Result:
> #this is a comment: R will ignore it > (100+2)/3 #You can use round brackets to group operations so that they are carried out first [1] 34 > 5*10^2 #The symbol * means multiply, and ^ means "to the power", so this is 5 times (10 squared) [1] 500 > 1/0 #R knows about infinity (and minus infinity) [1] Inf > 0/0 #undefined results take the value NaN ("not a number") [1] NaN > (0i9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers [1] 0+3i
 If you don't know anything about complex numbers, don't worry: they are not important here.
 Note that you can't use curly brackets {} or square brackets [] to group operations together
Storing objects
<
and >
as demonstrated in the exercise below. Which sign you use depends on whether you prefer putting the name first or last (it may be helpful to think of >
as "put into" and <
as "set to").
Unlike many statistical packages, R does not usually display the results of analyses you perform. Instead, analyses usually end up by producing an object which can be stored. Results can then be obtained from the object at leisure. For this reason, when doing statistics in R, you will often find yourself naming and storing objects. The name you choose should consist of letters, numbers, and the "." character^{[3]}, and should not start with a number.
Input:
1 0.001 > small.num #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)
2 big.num < 10 * 100 #You can put the name first if you reverse the arrow (set big.num to 10000).
3 big.num+small.num+1 #Now you can treat big.num and small.num as numbers, and use them in calculations
4 my.result < big.num+small.num+2 #And you can store the result of any calculation
5 my.result #To look at the stored object, just type its name
6 pi #There are some named objects that R provides for you
Result:
> 0.001 > small.num #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num) > big.num < 10 * 100 #You can put the name first if you reverse the arrow (set big.num to 10000). > big.num+small.num+1 #Now you can treat big.num and small.num as numbers, and use them in calculations [1] 1001.001 > my.result < big.num+small.num+2 #And you can store the result of any calculation > my.result #To look at the stored object, just type its name [1] 1002.001 > pi #There are some named objects that R provides for you [1] 3.141593
Functions
Input:
1 citation()
Result:
> citation() To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070, URL http://www.Rproject.org. A BibTeX entry for LaTeX users is @Manual{, url = {http://www.Rproject.org}, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3900051070}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages.
citation()
function. It can take an optional argument giving the name of an R addon package. If you do not provide an optional argument, there is usually an assumed default value (in the case of citation()
, this default value is "base"
, i.e. provide the citation reference for the base package: the package which provides most of the foundations of the R language).
Most arguments to a function are named. For example, the first argument of the citation function is named package. To provide extra clarity, when using a function you can provide arguments in the longer form name=value. Thus
citation("base")
does the same as
citation(package="base")
If a function can take more than one argument, using the long form also allows you to change the order of arguments, as shown in the example code below.
Input:
1 citation("base") #Does the same as citation(), because the default for the first argument is "base"
2 #Note: quotation marks are needed in this particular case (see discussion below)
3 citation("datasets") #Find the citation for another package (in this case, the result is very similar)
4 sqrt(25) #A different function: "sqrt" takes a single argument, returning its square root.
5 sqrt(259) #An argument can contain arithmetic and so forth
6 sqrt(259)+100 #The result of a function can be used as part of a further analysis
7 max(10, 0.2, 4.5) #This function returns the maximum value of all its arguments
8 sqrt(2 * max(10, 0.2, 4.5)) #You can use results of functions as arguments to other functions
9 x < sqrt(2 * max(10, 0.2, 4.5)) + 100 #... and you can store the results of any of these calculations
10 x
11 log(100) #This function returns the logarithm of its first argument
12 log(2.718282) #By default this is the natural logarithm (base "e")
13 log(100, base=10) #But you can change the base of the logarithm using the "base" argument
14 log(100, 10) #This does the same, because "base" is the second argument of the log function
15 log(base=10, 100) #To have the base as the first argument, you have to use the form name=value
Result:
> citation("base") #Does the same as citation(), because the default for the first argument is "base" To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070, URL http://www.Rproject.org. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3900051070}, url = {http://www.Rproject.org}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages. > #Note: quotation marks are needed in this particular case (see discussion below) > citation("datasets") #Find the citation for another package (in this case, the result is very similar) The 'datasets' package is part of R. To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070, URL http://www.Rproject.org. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3900051070}, url = {http://www.Rproject.org}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages. > sqrt(25) #A different function: "sqrt" takes a single argument, returning its square root. [1] 5 > sqrt(259) #An argument can contain arithmetic and so forth [1] 4 > sqrt(259)+100 #The result of a function can be used as part of a further analysis [1] 104 > max(10, 0.2, 4.5) #This function returns the maximum value of all its arguments [1] 4.5 > sqrt(2 * max(10, 0.2, 4.5)) #You can use results of functions as arguments to other functions [1] 3 > x < sqrt(2 * max(10, 0.2, 4.5)) + 100 #... and you can store the results of any of these calculations > x [1] 103 > log(100) #This function returns the logarithm of its first argument [1] 4.60517 > log(2.718282) #By default this is the natural logarithm (base "e") [1] 1 > log(100, base=10) #But you can change the base of the logarithm using the "base" argument [1] 2 > log(100, 10) #This does the same, because "base" is the second argument of the log function [1] 2 > log(base=10, 100) #To have the base as the first argument, you have to use the form name=value [1] 2
Note that when typing normal text (as in the name of a package), it needs to be surrounded by quotation marks^{[4]}, to avoid confusion with the names of objects. In other words, in R
citation
refers to a function, whereas
"citation"
is a "string" of text. This is useful, for example when providing titles for plots, etc.
You will probably find that one of the trickiest aspects of getting to know R is knowing which function to use in a particular situation. Fortunately, R not only provides documentation for all its functions, but also ways of searching through the documentation, as well as other ways of getting help.Getting help
Some versions of R give easy access to help files without having to type in commands (for example, versions which provide menu bars usually have a "help" menu, and the Macintosh interface also has a help box in the top right hand corner). However, this functionality can always be accessed by typing in the appropriate commands. You might like to type some or all of the following into an R session (no output is listed here because the result will depend on your R system).
1 help.start() #A webbased set of help pages (try the link to "An Introduction to R")
2 help(sqrt) #Show details of the "sqrt" and similar functions
3 ?sqrt #A shortcut to do the same thing
4 example(sqrt) #run the examples on the bottom of the help page for "sqrt"
5 help.search("maximum") #gives a list of functions involving the word "maximum", but oddly, "max" is not in there!
6 ### The next line is commented out to reduce internet load. To try it, remove the first # sign.
7 #RSiteSearch("maximum") #search the R web site for anything to do with "maximum". Probably overkill here!
max()
function by looking at the "See also" section of the help file for which.max()
. Not ideal!.Quitting R
quit()
or its identical shortcut, q()
, which do not require any arguments. Alternatively, if your version of R has a menu bar, you can select "quit" or "exit" with the mouse.
1 q()
Setting up wikibooks[edit]
Before you start on the main text, we recommend that you add a few specific wikibooks preferences. The first three lines will display the examples of R commands in a nicer format. The last line gives a nicer format to figures consisting of multiple plots (known as subfigures). You can do this by creating a user CSS file, as follows.
 Make sure you are logged in (and create yourself an account if you do not have one already).
 Visit your personal css stylesheet, at Special:MyPage/skin.css.
 Click on "Edit this page".
 Paste the following lines into the large edit box
pre {padding:0; border: none; margin:0; lineheight: 1.5em; } .code .input ol {liststyle: none; fontsize: 1.2em; marginleft: 0;} .code .input ol li div:before {content: "\003E \0020";} table.subfigures div.thumbinner, table.subfigures tr td, table.subfigures {border: 0;}
^{[5]}
 If you know any CSS, make any alterations you like to this stylesheet.
 Finally save the page by clicking on "Save page",
Enough! Let's move on to the main text.
Notes[edit]
 ↑ These are poor attempts at a statistical jokes, as you will soon find out.
 ↑ Depending on how you are viewing this book, may see a ">" character in front of each command. This is not part of the command to type: it is produced by R itself to prompt you to type something. This character should be automatically omitted if you are copying and pasting from the online version of this book, but if you are reading the paper or pdf version, you should omit the ">" prompt when typing into R.
 ↑ If you are familiar with computer programming languages, you may be used to using the underscore ("_") character in names. In R, "." is usually used in its place.
 ↑ you can use either single (') or double (") quotes to delimit text strings, as long as the start and end quotes match
 ↑ (note that this is a temporary hack until GeSHi supports R code, in which case Statistical Analysis: an Introduction using R/R/Syntax can be changed. The css code should really read
.pre {padding:0; border: none; margin:0; lineheight: 1.5em; } .sourceR ol {liststyle: none; fontsize: 1.2em; marginleft: 0;} .source_R ol li div:before {content: "\003E \0020";}