Statistical Analysis: an Introduction using R/R basics

From Wikibooks, open books for an open world
Jump to: navigation, search

Why R?[edit]

R is a command-driven statistical package. At first sight, this can make it rather daunting to use. However, there are a number of reasons to learn statistics using this computer program. The two most important are:

  • R is free; you can download it from http://www.r-project.org and install it onto just about any sort of computer you like.
  • R allows you to do all the statistical tests you are likely to need, from simple to highly advanced ones. This means that you should always be able to perform the right analysis on your data.

An additional bonus is that R has excellent graphics and programming capabilities, so can be used as an aid to teaching and learning. For example, all the illustrations in this book have been produced using R; by clicking on any illustration, you can obtain the R commands used to produce it.

A final benefit, which is of more use once you have some basic knowledge of either statistics or R, is that there are many online resources to help users of R. A list is available in the appendix to this book.

How to use this book with R[edit]

The main text in this book describes the why and how of statistics, which is relevant whatever statistical package you use. However, alongside the main text, there are a large number of "R topics": exercises and examples that use R to illustrate particular points. You may find that it takes some time to get used to R, especially if you are unfamiliar with the idea of computer languages.

Don't worry! The topics in this chapter and in Chapter 2 should get you going, to the point where you can understand and use R's basic functionality. This chapter is intended to get you started: once you have installed R, there are topics on how to carry out simple calculations and use functions, how to store results, how to get help, and how to quit. The few exercises in Chapter 1 mainly show the possibilities open to you when using R, then Chapter 2 introduces the nuts and bolts of R usage: in particular vectors and factors, reading data into data frames, and plotting of various sorts. From then on, the exercises become more statistical in nature.

If you wish to work straight through these initial exercises before statistical discussion, they are collected here. Note that when working through R topics online, you may find it more visually appealing if you set up wikibooks to display R commands nicely. If the R topics get in the way of reading the main text, they can be hidden by clicking on the arrow at the top right of each box.

Starting R[edit]

If you don't already have R installed on your computer, download the latest version for free from http://www.r-project.org, and install the base system. You don't need to install any extra packages yet. Once you have installed it, start it up, and you should be presented with something like this:

R version 2.11.1 (2010-05-31)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

You are now in an R session. R is a command-driven program, and the ominous-looking ">" character means that R is now waiting for you to type something. Don't be daunted. You will soon get the hang of the simplest commands, and that is all you should need for the moment. And you will eventually find that the command-line driven interface gives you a degree of freedom and power[1] that is impossible to achieve using more "user-friendly" packages.

v·d·e

R as a calculator

Text marked like this is used to discuss an R-specific point. The basics of R can be learned by reading these sections in the order they appear in the book. There will also be commands that can be entered directly into R; you should be able to copy-and-paste them directly into your R session[2]. Try the following to see how to use R as a simple calculator
Crystal Clear app terminal.png Input:
  1. 100+2/3
    
Crystal Clear app kscreensaver.png Result:
> 100+2/3

[1] 100.6667
In the absence of any instructions of what to do with the output of a command, R usually prints the result to the screen. For the time being, ignore the [1] before the answer: we will see that this is useful when R outputs many numbers at once. Note that R respects the standard mathematical rules of carrying out multiplication and division before addition and subtraction: it divides 2 by 3 before adding 100.
R commands can sometimes be rather difficult to follow, so occasionally it can be useful to annotate them with comments. This can be done by typing a hash (#) character: any further text on the same line is ignored by R. This will be used extensively in the R examples in this wikibook, e.g.
Crystal Clear app terminal.png Input:
  1. #this is a comment: R will ignore it
    
  2. (100+2)/3    #You can use round brackets to group operations so that they are carried out first
    
  3. 5*10^2       #The symbol * means multiply, and ^ means "to the power", so this gives 5 times (10 squared), i.e. 500
    
  4. 1/0          #R knows about infinity (and minus infinity)
    
  5. 0/0          #undefined results take the value NaN ("not a number")
    
  6. (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers
    
Crystal Clear app kscreensaver.png Result:
> #this is a comment: R will ignore it

> (100+2)/3 #You can use round brackets to group operations so that they are carried out first [1] 34 > 5*10^2 #The symbol * means multiply, and ^ means "to the power", so this is 5 times (10 squared) [1] 500 > 1/0 #R knows about infinity (and minus infinity) [1] Inf > 0/0 #undefined results take the value NaN ("not a number") [1] NaN > (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers [1] 0+3i
  • If you don't know anything about complex numbers, don't worry: they are not important here.
  • Note that you can't use curly brackets {} or square brackets [] to group operations together


v·d·e

Storing objects

R is what is known as an "object-oriented" program. Everything (including the numbers you have just typed) is a type of object. Later we will see why this concept is so useful. For the time being, you need only note that you can give a name to an object, which has the effect of storing it for later use. Names can be assigned by using the arrow-like signs <- and -> as demonstrated in the exercise below. Which sign you use depends on whether you prefer putting the name first or last (it may be helpful to think of -> as "put into" and <- as "set to").

Unlike many statistical packages, R does not usually display the results of analyses you perform. Instead, analyses usually end up by producing an object which can be stored. Results can then be obtained from the object at leisure. For this reason, when doing statistics in R, you will often find yourself naming and storing objects. The name you choose should consist of letters, numbers, and the "." character[3], and should not start with a number.

Crystal Clear app terminal.png Input:
  1. 0.001 -> small.num                #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)
    
  2. big.num <- 10 * 100               #You can put the name first if you reverse the arrow (set big.num to 10000).
    
  3. big.num+small.num+1               #Now you can treat big.num and small.num as numbers, and use them in calculations
    
  4. my.result <- big.num+small.num+2  #And you can store the result of any calculation
    
  5. my.result                         #To look at the stored object, just type its name
    
  6. pi                                #There are some named objects that R provides for you
    
Crystal Clear app kscreensaver.png Result:
> 0.001 -> small.num                #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)

> big.num <- 10 * 100 #You can put the name first if you reverse the arrow (set big.num to 10000). > big.num+small.num+1 #Now you can treat big.num and small.num as numbers, and use them in calculations [1] 1001.001 > my.result <- big.num+small.num+2 #And you can store the result of any calculation > my.result #To look at the stored object, just type its name [1] 1002.001 > pi #There are some named objects that R provides for you [1] 3.141593
Note that when the end result of a command is to store (assign) an object, as on input lines 1, 2, and 4, R doesn't print anything to the screen.


v·d·e

Functions

Apart from numbers, perhaps the most useful named objects in R are functions. Nearly everything useful that you will do in R is carried out using a function, and many are available in R by default. You can use (or "call") a function by typing its name followed by a pair of round brackets. For instance, the start up text mentions the following function, which you might find useful if you want to reference R in published work:
Crystal Clear app terminal.png Input:
  1. citation()
    
Crystal Clear app kscreensaver.png Result:
> citation()

To cite R in publications use:

  R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical
  Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

A BibTeX entry for LaTeX users is

  @Manual{,
    url = {http://www.R-project.org},
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Development Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2008},
    note = {{ISBN} 3-900051-07-0},
  }

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.
Many R functions can produce results which differ depending on arguments that you provide to them. Arguments are placed inside the round brackets, separated by commas. Many functions have one or more optional arguments: that is, you can choose whether or not to provide them. An example of this is the citation() function. It can take an optional argument giving the name of an R add-on package. If you do not provide an optional argument, there is usually an assumed default value (in the case of citation(), this default value is "base", i.e. provide the citation reference for the base package: the package which provides most of the foundations of the R language).

Most arguments to a function are named. For example, the first argument of the citation function is named package. To provide extra clarity, when using a function you can provide arguments in the longer form name=value. Thus

citation("base")

does the same as

citation(package="base")

If a function can take more than one argument, using the long form also allows you to change the order of arguments, as shown in the example code below.

Crystal Clear app terminal.png Input:
  1. citation("base")      #Does the same as citation(), because the default for the first argument is "base"
    
  2.                       #Note: quotation marks are needed in this particular case (see discussion below)
    
  3. citation("datasets")  #Find the citation for another package (in this case, the result is very similar)
    
  4. sqrt(25)              #A different function: "sqrt" takes a single argument, returning its square root.
    
  5. sqrt(25-9)            #An argument can contain arithmetic and so forth
    
  6. sqrt(25-9)+100        #The result of a function can be used as part of a further analysis
    
  7. max(-10, 0.2, 4.5)    #This function returns the maximum value of all its arguments
    
  8. sqrt(2 * max(-10, 0.2, 4.5))             #You can use results of functions as arguments to other functions
    
  9. x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100  #... and you can store the results of any of these calculations
    
  10. x
    
  11. log(100)              #This function returns the logarithm of its first argument
    
  12. log(2.718282)         #By default this is the natural logarithm (base "e")
    
  13. log(100, base=10)     #But you can change the base of the logarithm using the "base" argument
    
  14. log(100, 10)          #This does the same, because "base" is the second argument of the log function
    
  15. log(base=10, 100)     #To have the base as the first argument, you have to use the form name=value
    
Crystal Clear app kscreensaver.png Result:
> citation("base")      #Does the same as citation(), because the default for the first argument is "base"

To cite R in publications use:

  R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for
  Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Development Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2008},
    note = {{ISBN} 3-900051-07-0},
    url = {http://www.R-project.org},
  }

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.

>                       #Note: quotation marks are needed in this particular case (see discussion below)
> citation("datasets")  #Find the citation for another package (in this case, the result is very similar)

The 'datasets' package is part of R.  To cite R in publications use:

  R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for
  Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Development Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2008},
    note = {{ISBN} 3-900051-07-0},
    url = {http://www.R-project.org},
  }

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.

> sqrt(25)              #A different function: "sqrt" takes a single argument, returning its square root.
[1] 5
> sqrt(25-9)            #An argument can contain arithmetic and so forth
[1] 4
> sqrt(25-9)+100        #The result of a function can be used as part of a further analysis
[1] 104
> max(-10, 0.2, 4.5)    #This function returns the maximum value of all its arguments
[1] 4.5
> sqrt(2 * max(-10, 0.2, 4.5))             #You can use results of functions as arguments to other functions
[1] 3
> x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100  #... and you can store the results of any of these calculations
> x
[1] 103
> log(100)              #This function returns the logarithm of its first argument
[1] 4.60517
> log(2.718282)         #By default this is the natural logarithm (base "e")
[1] 1
> log(100, base=10)     #But you can change the base of the logarithm using the "base" argument
[1] 2
> log(100, 10)          #This does the same, because "base" is the second argument of the log function
[1] 2
> log(base=10, 100)     #To have the base as the first argument, you have to use the form name=value
[1] 2

Note that when typing normal text (as in the name of a package), it needs to be surrounded by quotation marks[4], to avoid confusion with the names of objects. In other words, in R

citation

refers to a function, whereas

"citation"

is a "string" of text. This is useful, for example when providing titles for plots, etc.

You will probably find that one of the trickiest aspects of getting to know R is knowing which function to use in a particular situation. Fortunately, R not only provides documentation for all its functions, but also ways of searching through the documentation, as well as other ways of getting help.


v·d·e

Getting help

There are a number of ways to get help in R, and there is also a wide variety of online information. Most installations of R come with a reasonably detailed help file called "An Introduction to R", but this can be rather technical for first-time users of a statistics package. Almost all functions and other objects that are automatically provided in R have a help page which gives intricate details about how to use them. These help pages usually also contain examples, which can be particularly helpful for new users. However, if you don't know the name of what you are looking for, then finding help may not be so easy, although it is possible to search for keywords and concepts that are associated with objects.

Some versions of R give easy access to help files without having to type in commands (for example, versions which provide menu bars usually have a "help" menu, and the Macintosh interface also has a help box in the top right hand corner). However, this functionality can always be accessed by typing in the appropriate commands. You might like to type some or all of the following into an R session (no output is listed here because the result will depend on your R system).

  1. help.start()            #A web-based set of help pages (try the link to "An Introduction to R")
    
  2. help(sqrt)              #Show details of the "sqrt" and similar functions
    
  3. ?sqrt                   #A shortcut to do the same thing
    
  4. example(sqrt)           #run the examples on the bottom of the help page for "sqrt"
    
  5. help.search("maximum")  #gives a list of functions involving the word "maximum", but oddly, "max" is not in there!
    
  6. ### The next line is commented out to reduce internet load. To try it, remove the first # sign.
    
  7. #RSiteSearch("maximum")  #search the R web site for anything to do with "maximum". Probably overkill here!
    
The last but one command illustrates a problem you may come across with using the R help functions. The searching facility for help files is sometimes a bit hit-and-miss. If you can't find exactly what you are looking for, it is often useful to look at the "See also" section of any help files that sound vaguely similar or relevant. In this case, you might probably eventually find the max() function by looking at the "See also" section of the help file for which.max(). Not ideal!.


v·d·e

Quitting R

To quit R, you can use either the function quit() or its identical shortcut, q(), which do not require any arguments. Alternatively, if your version of R has a menu bar, you can select "quit" or "exit" with the mouse.
  1. q()
    
Either way, you will be asked if you want to save the workspace image. This will save all the work you have done so far, and load it up when you next start R. Although this sounds like a good idea, if you answer "yes", you will soon find yourself loading up lots of irrelevant past analyses every time you start R. So answer "no" if you want to quit cleanly.


Setting up wikibooks[edit]

Before you start on the main text, we recommend that you add a few specific wikibooks preferences. The first three lines will display the examples of R commands in a nicer format. The last line gives a nicer format to figures consisting of multiple plots (known as subfigures). You can do this by creating a user CSS file, as follows.

  • Make sure you are logged in (and create yourself an account if you do not have one already).
  • Visit your personal css stylesheet, at Special:MyPage/skin.css.
  • Click on "Edit this page".
  • Paste the following lines into the large edit box
pre {padding:0; border: none; margin:0; line-height: 1.5em; }
.code .input ol {list-style: none; font-size: 1.2em; margin-left: 0;}
.code .input ol li div:before {content: "\003E \0020";}
table.subfigures div.thumbinner, table.subfigures tr td, table.subfigures {border: 0;}

[5]

  • If you know any CSS, make any alterations you like to this stylesheet.
  • Finally save the page by clicking on "Save page",

Enough! Let's move on to the main text.

Notes[edit]

  1. These are poor attempts at a statistical jokes, as you will soon find out.
  2. Depending on how you are viewing this book, may see a ">" character in front of each command. This is not part of the command to type: it is produced by R itself to prompt you to type something. This character should be automatically omitted if you are copying and pasting from the online version of this book, but if you are reading the paper or pdf version, you should omit the ">" prompt when typing into R.
  3. If you are familiar with computer programming languages, you may be used to using the underscore ("_") character in names. In R, "." is usually used in its place.
  4. you can use either single (') or double (") quotes to delimit text strings, as long as the start and end quotes match
  5. (note that this is a temporary hack until GeSHi supports R code, in which case Statistical Analysis: an Introduction using R/R/Syntax can be changed. The css code should really read
    .pre {padding:0; border: none; margin:0; line-height: 1.5em; }
    .source-R ol {list-style: none; font-size: 1.2em; margin-left: 0;}
    .source_R ol li div:before {content: "\003E \0020";}