Statistics Ground Zero/Association

Association

We use the term association to indicate that two variables are not independent of each other. So, we expect that a change in value in one of the variables will be associated with a change in the other. We do not assume that relationship between two variables reflects causation. This last point has become something of a cliche and it is worth pointing out that while correlation does not prove causation, it can be a pretty strong hint.

In a sense, association is one of the most basic and common statistical observations. In hypothesis testing we look at the association between two variables one treated as the independent variable (which we are free to modify) the other as dependent (which we observe for change in value).

So, for example, in the case of scalar variables, the Pearson correlation coefficient is the common measure of association, while for categorical data, we can test for association using Pearson's Chi-Square^[1] and indicate its strength using Cramer's V or for binary variables, the Phi Coefficient^[2].

Correlation

Correlation measures the strength of association between two variables. We will first consider the relationship between two scalar variables and then between ranked variables.

Pearson correlation coefficient

Pearson's R indicates the strength and direction of association between two scalar variables, ranging from -1 which indicates a strong inverse relationship and 1 indicating a strong direct relationship. At 0 we say there is no correlation. Pearson's R measures the linear dependence of one variable on another. Linear dependence is the degree to which one variable can be computed from the other by a w: linear equation (which will be explained in slightly more detail under regression below).

The significance of R is the probability that the null hypothesis R=0 is true. The statistic is distributed approximately as t.

Consider the following measurements of baby height against age:

Age (months	Height (cm)
0	53.0
3	59.5
6	66.0
9	71.5
12	76.0
18	85.0
24	90.0

This data can be visualised in this graph

where we can see that there is a strong direct relationship between age in months and height in centimeters. Indeed for this data the correlation coefficient value is 0.99 (rounded to two decimal places).

An inverse correlation

In the following data we see that as the number of widgets rises, the price per 100 widgets falls. This is an inverse correlation and has a negative value for Pearson's R.

No of widgets	Price per 100
1000	60
800	70
600	80
400	90
200	100

This is visualised in the following graph

For this data the correlation coefficient has a value of -1.

Spearman correlation coefficient (ranked data)

The Spearman Correlation Coefficient (ρ - usually pronounced as the English word row - boats not arguments) is the counterpart to Pearson's R for ordinal data. The coefficient expresses the degree of association between two ordinal variables. Two variables with a positive relationship, in this case with ρ=1, produce the following plot

*Monotonic Function*
Above we read that Pearson's R measures the linear dependence of x on y, that is the degree to which a linear function represents the relationship. Spearman's ρ can be interpreted as the extent to which a monotonic function represents the relationship between x and y. For our current purposes, a monotonic function is one that never changes direction in the y axis - that is there are no bumps or dips in the graph.

X and Y have a Spearman correlation coefficient of 1 in this graph

We see that rather than a straight line this produces a curve which while it may 'plateau', never reverses its direction of movement on the y axis. A score of -1 would produce a mirror image of this in the Y axis.

Although we commonly think of ρ as an 'alternative' to Pearson's R, something quite different is actually being measured. For our purposes however, since it does provide a measure of the strength of association between two ranked variables, the characterisation is acceptable.

Spearman's ρ is not the only non-parametric measure of correlation and we also see Kendall's τ which although it also indicates the strength of association does so by expressing the probability that the two series (x and y) are in the same order against the possibility that they are ranked differently.

Inferences on the Correlation Coefficient

The null hypothesis concerning correlation is that the correlation coefficient for the population is 0 and the p value of the coefficient based on the sample data and calculated by your software indicates whether you should reject the null hypothesis. If the confidence level is set to 95%, then the null hypothesis is rejected with p<0.05.

Chi-square: cross tabulation revisited

We saw previously how to cross tabulate nominal variables and count frequencies. The data looked like this:

Crosstabulation eye colour x gender
		gender
		f	m
eyecolour	blue	6	6
	brown	12	12
	green	7	7
	grey	4	6
	other	9	7
Total		38	38

We want to know if there is any association between eye colour and gender. The null hypothesis is that there is no association: each eye colour is as likely to be observed in either gender. To test this we will compute the statistic Pearson's Chi-square and then check its significance for the correct degrees of freedom. Of course, our statistics package will do all the hard work for us, we simply have to interpret the results. How is Chi-square calculated?

We will first add the expected counts to the cells of our table. The expected count for a cell is calculated as

(rowtotal*columntotal)/N

Crosstabulation eye colour x gender
		gender
		f	m
eyecolour	blue	6	6
	expected	6	6
	brown	12	12
	expected	12	12
	green	7	7
	expected	7	7
	grey	4	6
	expected	5	5
	other	9	7
	expected	8	8
Total counted		38	38
Total expected		38	38

These expected counts represent the probability of an observation falling in a particular cell.

Now we can ask, are the observed counts close to what we would expect or distant from it? This is the essential question behind testing for association between two categorical variables.

The statistic that is computed is Pearson's Chi-square.
The null hypothesis is that there is no association between the two variables
The degrees of freedom are calculated as (number of rows - 1)*(number of columns - 1)

You don't need to know the formula that calculates Chi-square, you should rely on your software to calculate it. However, so that you have seen it, here it is:

\chi ^{2}=\sum _{i=1}^{n}{\frac {(O_{i}-E_{i})^{2}}{E_{i}}}

Notice that this statistic again involves calculating deviations - this time the difference between the observed and expected counts. As in other cases the deviations are squared, this number is then divided by the expected count on a cell by cell basis and the result summed.

Our software (I used SPSS) provides the figures and we can now interpret the results:

We will set a confidence level of 95% (so what we require p<0.05 to reject the null hypothesis).
Our data produce a $\chi ^{2}$ statistic of 0.650, with degrees of freedom (5-1)*(2-1) = 4.
The value of p is 0.957, p>0.05 .

On these data we cannot reject the null hypothesis and we conclude that the variables are not associated but are independent of each other.

Notes

↑ I ignore Pearson's Contingency Coefficient here which is derived form the Chi-square statistic
↑ Cramer's V is suitable for ordinal data as well as nominal

[1] I ignore Pearson's Contingency Coefficient here which is derived form the Chi-square statistic

[2] Cramer's V is suitable for ordinal data as well as nominal

[1]

[2]