# Statistical Analysis: an Introduction using R/Chapter 3

## Contents

## Models[edit]

Anscombe?

- R topics: user functions & grouping using curly braces

## The Role of Assumptions[edit]

Imagine that we have some observations, and we want to use them to conclude something about the world around us. Statistics can help us in the common case when the observations are composed of a systematic component combined with random chance effects. A classic "toy" example is that of dice. Let us say that we wish to analyse the following result, obtained by rolling a die five times.

Rolls | |
---|---|

2 | |

3 | |

6 | |

6 | |

6 |

Although as a general rule, your first step should be to plot your data, there is little point in this instance. The dataset is so small that we can get a feel for it by just inspecting the numbers. The main striking feature is that we seem to have a preponderance of 6's, although this could, of course, be due to chance.

We could just treat the sequence {2,3,6,6,6} as a unique, arbitrary sequence of events. But this is rather pointless: data is usually analysed in order to seek general patterns, and by generalizing, increase our understanding. In this case, we might wish to use the result to decide whether the die is loaded in favour of sixes, and so if it can be relied upon to play a game.

Naively, we might hope to analyse this with a completely open mind; to approach the situation with no prior assumptions. A moment's thought should reveal this is impossible. For example, imagine that the die was a fraudster's dream: a sophisticated miniature machine that could be pre-programmed to give a particular sequence of numbers each day. If could, for example, be programmed to give a 2, followed by a 3, then three 6's, and spend the rest of the day rolling 1's. In this case, the results tally exactly with what we have observed, but they do not tell us anything about the subsequent behaviour of the die. Although it explains the data perfectly, most people would (quite reasonably) adopt the prior assumption that the "miniature machine" explanation was highly unlikely.

This is, of course, an extreme example, but it illustrates the point. Whether or not we realise it, we *always* examine data with prior notions of what is a reasonable explanation and what is not. Statistical analysis is the process of formalising these explanations, then using the data to choose between them. A good way to do this is by describing the **assumptions** that we have made in each case. For example, the following two assumptions are common to nearly all explanations that we might want to test.

- The Assumption of Honesty
- We assume that the data has been collected and reported “honestly”. This might not be the case if the data has been deliberately altered, or certain values have been “censored”. In our toy example, this might happen if say, the first four rolls were all 1's but the observer discarded these because this seemed an "unusual pattern”. Although dishonesty might seriously alter our conclusions, all we can do is to assume it has not taken place.

- The Assumption of Random Error
- We assume that the data have been affected by a process of “random chance”, causing the results to vary from one instance to the next in an unpredictable manner. In this case of the die, small changes in the way in which it is thrown and its subsequent tumbling and contact with the surface combine to produce one of six outcomes. Statisticians often (somewhat confusingly) refer to this process of chance as the source of “error” in the data. Describing the way in which chance works usually entails a whole set of other assumptions, which is the focus of
**Chapter 3**.

### Making the assumptions clear[edit]

The problem with making any assumptions is that they are just that: assumptions. They may or may not not be true. When trying to convince others with our analysis, we are asking them to take our assumptions on trust. For this reason we should try to make widely accepted assumptions and, more importantly, ensure they are completely explicit. That way, others can decide for themselves if the analysis is to be trusted. We can encapsulate this most easily and concisely by formulating a **model** of the underlying process.

## Modelling reality[edit]

A common way of understanding the world around us is to describe it in terms of a **model**. The more

Many basic statistics books teach simple tests, such as the t-test or sign-test. These are all based on an underlying model.

So what does an appropriate model look like? There are various ways in which we can

### Some verbal models[edit]

Testing a particular model.

- Model 1 — a completely biased die
- the die always gives a 6 when thrown. No extra assumptions are needed here, and in this extreme case, there is no error process.

We can easily disprove this by a single observation. However, we can never prove it. This turns out to be generally true. It is impossible to prove that something is the case, because there could always be a

- Model 2 — a fair die
- Here is a more complicated model, with the following assumptions
- Each roll has 6 possible outcomes, with a number from 1-6 selected each time. This is sampling "with replacement". Defines the set of possibilities (the sample space)
- The assumption of independence: one roll does not tell us anything extra about what will happen on a subsequent roll.
- The assumption of homogeneity: the chance of any outcome is the same for each roll
- Fairness: There is an equal chance of any of the 6 being chosen each number being equally probable. The

If the model contains an element of chance, how can we know whether ***

Once we have our models, we could either

- try to disprove this model (although, because in this case there is always a slight chance of 6 in a row, we can never completely
- compare the models in order to find out which is better

The simplest is simulation (compare to likelihood)

### Simulating a model[edit]

One of the major ways in which we can use models is simulation. This will be a major way in which models are explored in this book. To do so, we need to convert the various models described above into simulations. The "fair die" model above provides a good, simple example. We will convert this model to a simulation in R. This involves learning a little about how R deals with numbers, so you should check that you are comfortable with the idea of **functions** in R, as described previously.

Sampling with replacement - describe here the idea of using random sampling for simulation

#### Random sampling[edit]

*x*must be a vector of items,

*size*must be a number. Since 1:6 gives a vector of the numbers from 1 to 6, we can set x=1:6 and size=5. Here are 5 examples (note that the first 4 are equivalent, although the actual result will differ due to chance effects when sampling

^{[1]}).

###The next 4 lines are equivalent, 5 numbers are selected from a list of 1..6 sample(x=1:6, size=5, replace=FALSE) #when sampling WITHOUT replacement, each number only appears once sample(replace=FALSE, size=5, x=1:6) #you can change the order of the arguments sample(x=1:6, size=5) #the same, because replace=FALSE by default sample(1:6, 5) #we don't need x= and size= if arguments are in the same order as in the help file ### The next line is a different model sample(1:6, 5, TRUE) #sampling WITH replacement (the same number can appear twice) ###The next 4 lines are equivalent, 5 numbers are selected from a list of 1..6 sample(x=1:6, size=5, replace=FALSE) #when sampling WITHOUT replacement, each number only appears once [1] 1 5 4 3 6 sample(replace=FALSE, size=5, x=1:6) #you can change the order of the arguments [1] 5 6 4 2 1 sample(x=1:6, size=5) #the same, because replace=FALSE by default [1] 2 3 4 6 5 sample(1:6, 5) #we don't need x= and size= if arguments are in the same order as in the help file [1] 1 6 3 5 4 ### Now simulate a different model sample(1:6, 5, TRUE) #sampling WITH replacement (the same number can appear twice) [1] 3 6 2 1 3

sample(1:6, 5, TRUE)

## References[edit]

- ↑ call set.seed(1) before each chapter to get exactly the same results

## Testing Models[edit]

We can try to disprove a particular model, or select between different models using some informed judgement

Various ways to test a model. E.g. compare results from the simulation with the observed ones

We now have a simple method of simulating data produced by the model. How can we

Now that we can simulate How do we ***. We are unlikely to get exactly the sequence we observed. A classic method is to use a **sample statistic**. If we got 3 fives or 3 ones we would also be surprised. Link to idea of probability space.

The sample statistics

#### Combining functions

###### Input:

tabulate(c(2,3,6,6,6)) #an example: we can see that the

max(tabulate(c(2,3,6,6,6))) #simply confirms what

###### Result:

See#### Simulations using replicate()

###### Input:

Invalid language.

You need to specify a language like this: `<source lang="html4strict">...</source>`

Supported languages for syntax highlighting:

`4cs`, `6502acme`, `6502kickass`, `6502tasm`, `68000devpac`, `abap`, `actionscript`, `actionscript3`, `ada`, `aimms`, `algol68`, `apache`, `applescript`, `apt_sources`, `arm`, `asm`, `asp`, `asymptote`, `autoconf`, `autohotkey`, `autoit`, `avisynth`, `awk`, `bascomavr`, `bash`, `basic4gl`, `bf`, `bibtex`, `blitzbasic`, `bnf`, `boo`, `c`, `c_loadrunner`, `c_mac`, `c_winapi`, `caddcl`, `cadlisp`, `cfdg`, `cfm`, `chaiscript`, `chapel`, `cil`, `clojure`, `cmake`, `cobol`, `coffeescript`, `cpp`, `cpp-qt`, `cpp-winapi`, `csharp`, `css`, `cuesheet`, `d`, `dart`, `dcl`, `dcpu16`, `dcs`, `delphi`, `diff`, `div`, `dos`, `dot`, `e`, `ecmascript`, `eiffel`, `email`, `epc`, `erlang`, `euphoria`, `ezt`, `f1`, `falcon`, `fo`, `fortran`, `freebasic`, `freeswitch`, `fsharp`, `gambas`, `gdb`, `genero`, `genie`, `gettext`, `glsl`, `gml`, `gnuplot`, `go`, `groovy`, `gwbasic`, `haskell`, `haxe`, `hicest`, `hq9plus`, `html4strict`, `html5`, `icon`, `idl`, `ini`, `inno`, `intercal`, `io`, `ispfpanel`, `j`, `java`, `java5`, `javascript`, `jcl`, `jquery`, `kixtart`, `klonec`, `klonecpp`, `latex`, `lb`, `ldif`, `lisp`, `llvm`, `locobasic`, `logtalk`, `lolcode`, `lotusformulas`, `lotusscript`, `lscript`, `lsl2`, `lua`, `m68k`, `magiksf`, `make`, `mapbasic`, `matlab`, `mirc`, `mmix`, `modula2`, `modula3`, `mpasm`, `mxml`, `mysql`, `nagios`, `netrexx`, `newlisp`, `nginx`, `nimrod`, `nsis`, `oberon2`, `objc`, `objeck`, `ocaml`, `ocaml-brief`, `octave`, `oobas`, `oorexx`, `oracle11`, `oracle8`, `oxygene`, `oz`, `parasail`, `parigp`, `pascal`, `pcre`, `per`, `perl`, `perl6`, `pf`, `php`, `php-brief`, `pic16`, `pike`, `pixelbender`, `pli`, `plsql`, `postgresql`, `postscript`, `povray`, `powerbuilder`, `powershell`, `proftpd`, `progress`, `prolog`, `properties`, `providex`, `purebasic`, `pycon`, `pys60`, `python`, `q`, `qbasic`, `qml`, `racket`, `rails`, `rbs`, `rebol`, `reg`, `rexx`, `robots`, `rpmspec`, `rsplus`, `ruby`, `rust`, `sas`, `scala`, `scheme`, `scilab`, `scl`, `sdlbasic`, `smalltalk`, `smarty`, `spark`, `sparql`, `sql`, `standardml`, `stonescript`, `systemverilog`, `tcl`, `teraterm`, `text`, `thinbasic`, `tsql`, `typoscript`, `unicon`, `upc`, `urbi`, `uscript`, `vala`, `vb`, `vbnet`, `vbscript`, `vedit`, `verilog`, `vhdl`, `vim`, `visualfoxpro`, `visualprolog`, `whitespace`, `whois`, `winbatch`, `xbasic`, `xml`, `xorg_conf`, `xpp`, `yaml`, `z80`, `zxbasic`

replicate(1000, max(tabulate(sample(1:6, 5, TRUE))))

###### Result:

> replicate(1000, max(tabulate(sample(1:6, 5, TRUE))))

[1] 3 2 2 2 2 3 2 2 3 2 2 2 2 2 3 2 2 2 3 1 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 [36] 1 3 4 2 2 2 2 2 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 [71] 1 3 2 2 2 2 1 2 3 2 2 2 3 2 2 3 2 2 2 3 2 2 3 2 2 1 2 2 2 3 2 2 1 2 3 [106] 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 3 3 2 1 2 [141] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 [176] 2 3 2 2 2 2 2 2 2 2 2 2 4 2 4 2 2 2 1 2 2 2 2 3 2 3 3 2 2 2 2 2 3 2 2 [211] 2 3 2 2 2 2 2 2 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 1 [246] 2 2 3 2 3 2 2 3 2 3 2 1 2 2 2 1 2 2 2 2 3 2 3 2 2 2 3 2 3 2 2 2 2 2 2 [281] 2 2 2 2 3 2 2 2 3 2 2 3 2 2 1 2 2 1 3 3 2 2 2 2 2 2 3 2 2 2 3 2 2 2 3 [316] 4 3 2 1 1 3 2 2 2 3 3 1 3 2 2 1 2 4 2 3 2 2 2 1 2 2 2 2 2 2 3 2 1 2 2 [351] 1 3 2 2 3 2 2 2 2 3 1 4 2 3 3 3 2 4 3 2 2 1 2 2 2 2 2 2 3 2 2 1 2 3 2 [386] 3 2 2 4 2 2 2 1 1 2 3 3 3 2 2 2 2 2 3 2 2 1 2 3 1 2 2 2 2 2 2 2 2 3 2 [421] 2 1 2 3 2 2 2 2 1 3 2 2 2 2 2 3 1 1 2 2 2 2 2 2 3 2 2 2 3 3 2 2 2 3 2 [456] 2 1 2 2 2 2 2 2 3 2 2 3 1 3 2 2 3 2 3 2 2 2 2 1 2 3 2 3 2 2 3 2 4 2 2 [491] 3 2 2 3 2 2 2 4 3 1 2 2 3 2 2 2 2 2 2 4 1 2 2 1 2 2 2 2 2 3 1 2 2 2 2 [526] 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 3 1 1 2 3 2 2 2 2 2 2 [561] 4 2 1 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 1 2 3 2 2 2 3 2 2 2 2 2 2 2 2 [596] 2 2 3 2 2 2 3 2 3 2 2 2 2 2 1 2 2 3 3 3 2 2 2 2 2 2 3 2 4 1 2 2 2 2 2 [631] 3 2 2 2 2 2 2 2 3 2 2 3 3 2 2 1 1 2 2 3 2 4 2 1 2 2 1 2 2 2 2 2 2 3 2 [666] 2 2 2 3 2 2 2 3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 3 2 2 3 3 2 2 2 3 3 2 2 3 [701] 3 3 2 2 2 2 2 1 2 2 3 2 2 2 2 3 2 3 2 3 2 1 2 2 2 2 2 2 3 2 1 2 2 2 2 [736] 3 3 2 3 2 2 2 3 2 2 2 1 2 2 2 3 2 3 3 2 2 3 1 2 2 2 2 2 4 2 2 2 2 2 2 [771] 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 3 3 3 2 2 2 2 2 2 [806] 2 2 3 1 2 2 4 2 2 1 4 2 3 3 2 2 2 3 2 1 2 2 3 2 2 2 1 2 2 2 2 2 2 2 1 [841] 2 2 1 2 1 2 3 2 2 2 3 3 2 3 1 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 2 1 1 2 2 [876] 1 1 2 2 2 3 1 2 2 2 1 2 2 2 2 2 2 2 2 2 3 2 2 2 1 2 3 2 2 3 2 1 2 3 1 [911] 3 2 3 2 3 3 1 2 2 2 3 2 1 2 2 2 2 2 3 4 2 2 2 2 3 2 2 2 4 4 2 1 1 2 2 [946] 3 3 2 2 3 2 2 2 3 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 3 1 3 4 3 2 2 2 2 2 2 [981] 1 2 2 2 2 2 3 1 3 3 2 2 3 2 2 4 2 2 4 2

You should be able to see that there are a smattering of 1’s,

Prelude to introducing the concept of probability.