Statistics Ground Zero/Association
We use the term association to indicate that two variables are not independent of each other. So, we expect that a change in value in one of the variables will be associated with a change in the other. We do not assume that relationship between two variables reflects causation. This last point has become something of a cliche and it is worth pointing out that while correlation does not prove causation, it can be a pretty strong hint.
In a sense, association is one of the most basic and common statistical observations. In hypothesis testing we look at the association between two variables one treated as the independent variable (which we are free to modify) the other as dependent (which we observe for change in value).
So, for example, in the case of scalar variables, the Pearson correlation coefficient is the common measure of association, while for categorical data, we can test for association using Pearson's Chi-Square and indicate its strength using Cramer's V or for binary variables, the Phi Coefficient.
Correlation measures the strength of association between two variables. We will first consider the relationship between two scalar variables and then between ranked variables.
Pearson correlation coefficient
Pearson's R indicates the strength and direction of association between two scalar variables, ranging from -1 which indicates a strong inverse relationship and 1 indicating a strong direct relationship. At 0 we say there is no correlation. Pearson's R measures the linear dependence of one variable on another. Linear dependence is the degree to which one variable can be computed from the other by a w: linear equation (which will be explained in slightly more detail under regression below).
The significance of R is the probability that the null hypothesis R=0 is true. The statistic is distributed approximately as t.
Consider the following measurements of baby height against age:
|Age (months||Height (cm)|
This data can be visualised in this graph
where we can see that there is a strong direct relationship between age in months and height in centimeters. Indeed for this data the correlation coefficient value is 0.99 (rounded to two decimal places).
An inverse correlation
In the following data we see that as the number of widgets rises, the price per 100 widgets falls. This is an inverse correlation and has a negative value for Pearson's R.
|No of widgets||Price per 100|
This is visualised in the following graph
For this data the correlation coefficient has a value of -1.
Spearman correlation coefficient (ranked data)
The Spearman Correlation Coefficient (ρ - usually pronounced as the English word row - boats not arguments) is the counterpart to Pearson's R for ordinal data. The coefficient expresses the degree of association between two ordinal variables. Two variables with a postitive relationship, in this case with ρ=1, produce the following plot
|Above we read that Pearson's R measures the linear dependence of x on y, that is the degree to which a linear function represents the relationship. Spearman's ρ can be interpreted as the extent to which a monotonic function represents the relationship between x and y. For our current purposes, a monotonic function is one that never changes direction in the y axis - that is there are no bumps or dips in the graph.|
We see that rather than a straight line this produces a curve which while it may 'plateau', never reverses its direction of movement on the y axis. A score of -1 would produce a mirror image of this in the Y axis.
Although we commonly think of ρ as an 'alternative' to Pearson's R, something quite different is actually being measured. For our purposes however, since it does provide a measure of the strength of association between two ranked variables, the characterisation is acceptable.
Spearman's ρ is not the only non-parametric measure of correlation and we also see Kendall's τ which although it also indicates the strength of association does so by expressing the probability that the two series (x and y) are in the same order against the possibility that they are ranked differently.
Inferences on the Correlation Coefficient
The null hypothesis concerning correlation is that the correlation coefficient for the population is 0 and the p value of the coefficient based on the sample data and calculated by your software indicates whether you should reject the null hypothesis. If the confidence level is set to 95%, then the null hypothesis is rejected with p<0.05.
Chi-square: cross tabulation revisited
We saw previously how to crosstabulate nominal variables and count frequencies. The data looked like this:
|Crosstabulation eye colour x gender|
We want to know if there is any association between eye colour and gender. The null hypothesis is that there is no association: each eye colour is as likely to be observed in either gender. To test this we will compute the statistic Pearson's Chi-square and then check its significance for the correct degrees of freedom. Of course, our statistics package will do all the hard work for us, we simply have to interpret the results. How is Chi-square calculated?
We will first add the expected counts to the cells of our table. The expected count for a cell is calculated as
|Crosstabulation eye colour x gender|
These expected counts represent the probability of an observation falling in a particular cell.
Now we can ask, are the observed counts close to what we would expect or distant from it? This is the essential question behind testing for association between two categorical variables.
- The statistic that is computed is Pearson's Chi-square.
- The null hypothesis is that there is no association between the two variables
- The degrees of freedom are calculated as (number of rows - 1)*(number of columns - 1)
You don't need to know the formula that caluclates Chi-square, you should rely on your software to calcuate it. However, so that you have seen it, here it is:
Notice that this statistic again involves calculating deviations - this time the difference between the observed and expected counts. As in other cases the deviations are squared, this number is then divided by the expected count on a cell by cell basis and the result summed.
Our software (I used SPSS) provides the figures and we can now interpret the results:
- We will set a confidence level of 95% (so what we require p<0.05 to reject the null hypothesis).
- Our data produce a statistic of 0.650, with degrees of freedom (5-1)*(2-1) = 4.
- The value of p is 0.957, p>0.05 .
On these data we cannot reject the null hypothesis and we conclude that the variables are not associated but are independent of each other.
- I ignore Pearson's Contingency Coefficient here which is derived form the Chi-square statistic
- Cramer's V is suitable for ordinal data as well as nominal