Statistics/Testing Data/FtestANOVA

From Wikibooks, open books for an open world
Jump to: navigation, search

The one-way ANOVA F-test is used to identify if there are differences between subject effects. For instance, to investigate the effect of a certain new drug on the number of white blood cells, in an experiment the drug is given to three different groups, one of healthy people, one with people with a light form of the considered disease and one with a severe form of the disease. Generally the analysis of variance identifies whether there is a significant difference in effect of the drug on the number of white blood cells between the groups. Significant refers to the fact that there will always be difference between the groups and also within the groups, but the purpose is to investigate whether the difference between the groups are large compared to the differences within the groups. To set up such an experiment three assumptions must be validated before calculating an F statistic: independent samples, homogeneity of variance, and normality. The first assumption suggests that there is no relation between the measurements for different subjects. Homogeneity of variance refers to equal variances among the different groups in the experiment (e.g., drug vs. placebo). Furthermore, the assumption of normality suggests that the distribution of each of these groups should be approximately normally distributed.

Model[edit]

The situation is modelled in the following way. The measurement of the j-th test person in group i is indicated by:

X_{ij}=\mu+\alpha_i+U_{ij}\,.

This reads: the outcome of the measurement for j in group i is due to a general effect indicated by \mu\,, an effect due to the group, \alpha_i\, and an individual contribution U_{ij}\,.

The individual, or random, contributions U_{ij}\,, often referred to as disturbances, are considered to be independently, normally distributed, all with expected value 0 and standard deviation \sigma.

To make the model unambiguous the group effects are restrained by the condition:

\sum_i\alpha_i=0\,.

Now. a notational note: it is common practice to indicate averages over one or more indices by writing a dot in the place of the index or indices. So for instance

X_{i.} = \frac{1}{N}\sum^N_{j=1} X_{ij}

The analysis of variance now divides the total "variance" in the form of the total "sum of squares" in two parts, one due to the variation within the groups and one due to the variation between the groups:

SST=\sum_{ij}(X_{ij}-X{..})^2=\sum_{ij}(X_{ij}-X_{i.}+X_{i.}-X{..})^2=
\sum_{ij}(X_{ij}-X_{i.})^2+\sum_{ij}(X_{i.}-X{..})^2\,.

We see the term sum of squares of error:

SSE=\sum_{ij}(X_{ij}-X_{i.})^2\,

of the total squared differences of the individual measurements from their group averages, as an indication of the variation within the groups, and the term sum of square of the factor

SSA=\sum_{ij}(X_{i.}-X{..})^2\,

of the total squared differences of the group means from the overall mean, as an indication of the variation between the groups.

Under the null hypothesis of no effect:

H_0: \forall_i\ \alpha_i=0

we find:

SSE/\sigma^2\, is chi-square distributed with a(m-1) degrees of freedom, and
SSA/\sigma^2\, is chi-square distributed with a-1 degrees of freedom,

where a is the number of groups and m is the number of persons in each group.

Hence the quotient of the so-called mean sum of squares:

MSA=\frac{SSA}{a-1}

and

MSE=\frac{SSE}{a(m-1)}

may be used as a test statistic

F=\frac{MSA}{MSE}

which under the null hypothesis is F-distributed with a-1 degrees of freedom in the nominator and a(m-1) in the denominator, because the unknown parameter \sigma does not play a role since it is cancelled out in the quotient.