Statistical Analysis: an Introduction using R/Chapter 1
Figure 1.1 shows one of the standard sets of data available in the R statistical package. In the 1920s, braking distances were recorded for cars travelling at different speeds. Analysing the relationship between speed and braking distance can influence the lives of a great number of people, via changes in speeding laws, car design, and the like. Other datasets, for example concerning medical or environmental information, have an even greater impact on humans and our understanding of the world. But real-world data are often "messy", as shown in the plot. Most people looking at the plot would be happy to conclude that speed and stopping distance are linked in some way. However, this cannot be the whole story because even at the same speed, different stopping distances were recorded.
This combination of intuitively sensible patterns and messiness is perhaps one of the most common scientific results. Statistical analysis can help us to clarify and judge the patterns we think we see, as well as revealing, out of the mess, effects that may be otherwise difficult to discern. It can be used as a tool to explain the world around us, and perhaps equally importantly, to convince others of the correctness of our explanations. This emphasis on informed judgement and convincing others is important: ideally, statistics should aid a clear explanation, rather than blind by force of argument.
A common misconception of statistical analysis is that it is necessarily technical and "hard to understand". Indeed "I don't understand statistics" is a cry frequently heard. But if an analysis can only be understood by statisticians, it has, to a great extent, failed. For an analysis to convince an audience, one or more reasonable explanations for a situation should carefully and comprehensibly lay down. Statistical analysis of data (which has, ideally, been collected to test these explanations) can then be used to justify a conclusion that can be generally agreed upon. As a rule, the simpler the explanation - the more parsimonious - the more it is to be preferred. Hence a good explanation is one which is sensible and comprehensible to an informed observer, which makes clear predictions, but is as simple as possible.
This book is intended to teach you how to formulate and test such "explanations" in a statistical manner. This is the basis of statistical analysis.
The statistical method
As with the "scientific method", to which it bears a number of similarities, it is open to debate whether or not there exists a universal "statistical method". Nevertheless, most statistical analyses involve questioning some aspect of the world, coming up with potential answers to these questions and then formalising these explanations by proposing statistical models that may account for particular aspects of the data.
Hence the three major parts to an analysis are
- Deciding the questions you wish to tackle (or more generally, the aim of the analysis)
- Proposing some reasonable explanations which have the potential to answer these questions
- Formalise these explanations as statistical models
- Collecting appropriate data
- (Using more or less technical methods) test the models using the data.
For example ***
Hence the majority of analyses that you are likely to perform assume, either explicitly or implicitly, an underlying "statistical model". So understanding statistical models, and what they entail, is fundamental to understanding statistics.
Chapter 3 discusses statistical models in greater detail, but it is worth introducing some general concepts here. Models provide some predictive ***. What distinguishes a statistical model is that there is also an element of uncertainty in the process. Hence statistical models consist of two components: one which is fixed, and one which captures the uncertainty. This is
As an example, we can return to the situation depicted in Figure 1.1. A point to which we will return many times, is that a good understanding of the background, or context of the data is essential to any analysis. In this case, it is important to know that the data concerns speed and braking distances of cars. Our general knowledge about driving can then guide our choice of model. In particular, we can probably guess that it seems reasonable to model a situation in which speed has an effect on braking distance (rather than, for example, the other way around). It seems reasonable to treat speed as something that is determined by other factors, and outside the scope of our analysis ***.
Hence an initial, reasonable model might assume that the braking distance is determined by speed, plus some sort of random error. To model this precisely, to construct an imaginary set of braking distances given a set of speeds, we need to be more precise. In particular, we need to specify how a speed affects distance, and what the error looks like***.
Later on we'll see how to specify this mathematically. For the moment, we'll do it graphically and by simulation
Common to assume "Normal" error ****
e.g. maybe no effect at all: there is a fixed braking distance regardless of speed: the only reason why distance vary is due to chance error. => graphically show
we need to be more precise than this: we
one of the most important things to It is reasonable to 1.2(a-d). In each of the four figures, a different statistical model has been proposed: the predictive part of each model is shown as a red line, and the uncertainty (the fluctuations attributed to chance or error) shown as light blue lines. In all cases, the model has assumed that the uncertainty is all in the effect of braking distance (the light blue lines are all vertical).
To show the uncertainty on its own, the right hand plot in each figure shows the residuals ***.
* The straight (orange) line. A linear relationship is perhaps the simplest way that a relationship between , which is why ***. simplest explanation for the pattern is that stopping distance is directly proportional to speed, so that for every extra mile-per-hour, stopping distance increases by a fixed amount: graphically, this represents a straight line on the plot. It is common to .
- The curved (red) line. However, this has not taken into account any : perhaps . As remarked before, this is not enough to explain all the data: It may be less clear that lowess is a model ****.
The "cars" example highlights the importance of prior knowledge when doing statistics. Data should not be treated as pure "numbers", but considered in context. Here, because we know that the data are the outcome of a relatively well-understood physical system, we should be led to ask if a particular relationship is expected, which then encourages us to use a simple quadratic fit.
Thus in any analysis it is vital to understand the data: to know where they come from, how they relate to the real world, and — by plotting or other forms of “exploratory data analysis” — what they look like. Indeed, plotting data is one of the most fundamental skills in a statistical analysis: it is a powerful tool for revealing patterns and convincing others. If necessary this can then be backed up by more formal, mathematical techniques. The use of graphics in this book is twofold: not only can they be used to explore data, but formal statistical methods can often be understood and pictured graphically, rather than by heavy use of mathematics. This is one of the principal aims of this book.
The Scientific Method
The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation.—R. A. Fisher
This book focuses on the important role of statistics in the scientific method. Fundamentally, science involves carefully testing a number of rational explanations, or "hypotheses", that purport to explain observed phenomena. Usually, this is done by proposing plausible hypotheses and then trying to eliminate one or more of them by careful experimentation or data collection (this is known as the Hypothetico-deductive method). This means that a basic requirement for a scientific hypothesis is that it can be proved wrong: that it is, in Popper's words "falsifiable". In this chapter, we will see that it is logically impossible to "prove" that a hypothesis is correct; nevertheless, the more tests that a hypothesis passes, the more we should be inclined to believe it.
Ideally, scientific research involves a repeated process of generating hypotheses, eliminating as many as possible, and refining the remaining ones, in a process known as "strong inference" (cite Platt). There are two very different steps involved here: a rather speculative one in which hypotheses are produced or refined, and a strictly logical one in which they are eliminated.
These two steps have their counterparts in statistical analysis. The branch of statistics concerned with hypothesis testing is designed to identify which hypotheses seem improbable, and hence may be eliminated. The branch of statistics concerned with exploratory analysis is designed to identify plausible explanations for a set of data. While we will begin our discussion of these techniques separately, it should be emphasised that in practice, the distinction is not so clear-cut. These two branches of statistical practice are better seen as extremes of a continuum of techniques available to the researcher. For example, many hypotheses involve numerical parameters, such as the slope of a best fit line. Statistical estimation of such parameters may be seen as a test of a hypothesis, but may also be seen as a suggestion for a refined or even novel explanation of the facts.
To properly test a hypothesis, the right sort of data need to be collected. In fact, the targeted collection of data and (where possible) the careful design of experiments, is probably the most important process in science. This need not be difficult. For example, imagine that our hypothesis is that (for genetic reasons) a person cannot have both blond hair and brown eyes. This hypothesis can be disproved by a single observation of a person with that combination of features.
Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16
Table 1.1 shows the results of a 1974 survey of students from the University of Delaware. As with any test, we need to make a few assumptions about this data, for example that students with dyed hair have not been included, or are listed under their original hair colour. If this is the case, then we can see immediately that we can reject the blond&brown:impossible hypothesis: there were 7 students with brown eyes and blond hair.
Imagine, however, if the survey had not revealed any students with brown eyes and blond hair. Although this would be consistent with our hypothesis, it would not be enough to prove it correct. It could be that brown eyes and blond hair are merely very rare, it being pure chance that we did not see any. This is a general problem. It is impossible to be sure that a hypothesis is correct: there may always be another, very similar explanation for the same observations.
However, as seen in this example, it is possible to reject hypotheses. For this reason, science relies on eliminating hypotheses. Often, therefore, scientists construct one or more hypotheses which they believe not to be the case, purely in order to be rejected. If the aim of a study is to convince other people of one particular theory or hypothesis, then a good way to do so is to define hypotheses which encompass as many reasonable alternative explanations as can be envisaged. If all of these can be disproved, then the remaining, original hypothesis becomes much more convincing.
The null hypothesis
In most scientific observations, there is an element of chance, and so the most important hypothesis to try to eliminate – at least initially – is that the observed data are due to chance alone. This is usually known as the null hypothesis.
In our initial example, the null hypothesis is relatively obvious: it is that there is no association between hair and eye colour (any seeming associations are purely due to chance). But constructing an appropriate null hypothesis is not always so easy. Here are three examples which we will investigate later, ranging from the simple to the highly complex.
- The sex-ratio among children born in a hospital (e.g. on 18 December 1997 in the Mater Mother's Hospital in Brisbane, Australia, a record 44 children were born , of which 18 were female). A reasonable null hypothesis might be that males and females are equally probable, regardless of sex. Since it is known that humans usually have a male-biased sex ratio, a different null hypothesis (say 51% males) may be more reasonable.
- The relationship seen in Figure 1.1 between speeds and stopping distances in cars. A reasonable null hypothesis might be that there is no relationship between the speed of a car and its stopping distance. However, Car stopping distances (no ``association between x & y) - More difficult, because error distribution is unknown. here's one way of doing it: could e.g. take ranks of x & ranks of y. Or sample.
In both cases, we need to *model* the null hypothesis: what would we expect if
- Deaths and serious injuries in cars in the UK from 1969-1985. Figure 1.2 shows that Seatbelts **** More complex null models - e.g. seatbelt - if we fit a st. line, we need to make some assumption about variation from the line. Or we can take the actual values as representative of the variation Here is a more complex that has affected most of the population in the UK is the introduction of compulsory wearing of seatbelts in cars: a law that came into force on 31st January 1983. Null model involves other factors (e.g. petrol price)
We can model the null hypothesis as long as we have enough detail. **what do you need for different egs**. Because there is random error, we need to do this lots of times. We will see that a lot of statistics relies on rejecting the null hypothesis based on how often it gives results like those observed.
The same goes for others, e.g. seatbelt??? Comparing models
Types of error
Hypotheses may not be a simple yes/no matter, but more sophisticated, e.g.
what sex ratio is suggested by the data? (that's easy, but how confident are we about the accuracy of this estimate?)
cars: we are convinced that there is a straight-line relationship between speed & braking distance: what is the slope of the line (but maybe physics would suggest a different relationship - straight line is usually the default) - already here we are making some sort of choice of model.
MLE brief description ``if the model is correct, what is the most likely value for these parameters?
An infinite number of models are out there. In combination with good understanding and /parsimony/, can be used to construct hypotheses & *models*. D.F?
E.g. are there better ideas than a straight line fit?
Is this at all sensible (e.g. if cannot go <0)
Outliers? ???residuals (probably not)
what about interactions? Titanic sex vs. class?
Use colours to pick out types
Human eye good at spotting patterns (but ... even if there are none) . E.g. time series
even if we have a prediction (model), how well does it fit the assumptions?
Which are important (as opposed to significant) factors?
Should have enough background to describe picture from []
How wide can we throw conclusions
Is each experiment only one data point, etc?
- Actually, the details are slightly more complex, depending on whether there is a default location to install the packages, see
- Not enough of R has yet been introduced to explain fully the commands used for the plots in this chapter. Nevertheless, for those who are interested, for any plot, the commands used to generate it are listed in the image summaries (which can be seen by clicking on the image).
- Unfortunately, details of the bewildering array of arguments available, many of which are common to other graphics-producing routines) are scattered around a number of help files. For example, to see the options for
plot()when called on a dataset, see
?par. To see the options for
plot()when called on a function, see
?plot.function. The numbers given to the
pchargument, specifying various plotting symbols, are listed in the help file for
points()(the function for adding points to a plot): they can be seen via
- from Snee (1974) The American Statistician, 28, 9–12. The full reference can be found within R by typing ?HairEyeColor. The table here has been aggregated over Sex as in
- when the hypothesis is simple and there are only small amounts of data like this, presenting it in tabular form is often just as useful as plotting
- this could either be truly random, or due to factors about which we are ignorant
- see []