Why, and How, Should Geologists Use Compositional Data Analysis/Dealing With Zero Values
There is almost not a case in exploration geology, where the studied data does not includes below detection limits and/or zero values, and since most of the geological data responds to lognormal distributions, these “zero data” represent a mathematical challenge for the processing.
The method that I am proposing takes into consideration the well-known relationships between some elements. For example, in copper porphide deposits there is always a significant direct correlation between the copper values and the molybdenum ones. However, while copper will always be above the limit of detection, many of the molybdenum values will be “rounded zeros”. In such a case, I will take the lower quartile of the real molybdenum values and establish a regression equation with copper, and then I will estimate the “rounded” zero values of molybdenum by their corresponding copper values.
One can apply this method to any type of data, provided we establish first their correlation dependency.
One of the main advantages of this method is that we do not obtain a fixed value for the “rounded zeros”, but one that depends on the value of the other variable.
Are there any zeros in the house?[edit]
We need to start by recognizing that there are zero values in geology. For example the amount of quartz in a syenite is zero, since quartz cannot co-exists with nepheline (Trusova and Chernov, 1982). In binomial distributions, like for example the drilling of an ore body, you either will intersect the ore body (1) or not (0). Another common zero is a North azimuth, however we can always change that zero for the value of 360°. These are the “Essential Zeros” (Aitchison, 2003) or “Real zeros”. They are not a problem for as long as their population does not respond to a lognormal distribution, since you cannot take the logarithm of a zero.
Then in geology, especially in geochemistry, we also have “Rounded Zeros”. In some cases, laboratories report below detection limit (b.d.l.) as zeros or non-existent, while in most cases they just put the b.d.l. as the value for that parameter. These b.d.l. values are a similar problem to the “Rounded Zeros”. Let us illustrate with the example proposed in Table 28.
Table 28. CLR transformed data for the used example. The original b.d.l. value was 0.5 ppm of Mo.
Au | Cu | Mo | Au | Cu | Mo | |
0.23342 | 0.72138 | 0.0452 | 0.22814 | 0.19547 | 0.57639 | |
0.32663 | 0.61146 | 0.06191 | 0.47232 | 0.48404 | 0.04363 | |
0.12652 | 0.16198 | 0.71149 | 0.21663 | 0.27648 | 0.50689 | |
0.20133 | 0.28139 | 0.51728 | 0.207 | 0.24005 | 0.55295 | |
0.20796 | 0.3302 | 0.46184 | 0.30235 | 0.24198 | 0.45567 | |
0.41506 | 0.51552 | 0.06942 | 0.20662 | 0.12838 | 0.665 | |
0.20034 | 0.21824 | 0.58142 | 0.25618 | 0.3309 | 0.41292 | |
Au | Cu | Mo | Au | Cu | Mo | |
0.13951 | 0.18003 | 0.68046 | 0.26629 | 0.29193 | 0.44178 | |
0.14029 | 0.12599 | 0.73372 | 0.50983 | 0.45371 | 0.03645 | |
0.12876 | 0.17744 | 0.6938 | 0.26377 | 0.25831 | 0.47792 | |
0.13442 | 0.28513 | 0.58045 | 0.50258 | 0.46207 | 0.03536 | |
0.22589 | 0.15577 | 0.61834 | 0.61358 | 0.3443 | 0.04212 | |
0.18861 | 0.13306 | 0.67834 | 0.34444 | 0.32052 | 0.33504 | |
0.26028 | 0.28088 | 0.45884 | 0.38402 | 0.1694 | 0.44658 | |
0.19831 | 0.31928 | 0.48241 | 0.24623 | 0.25965 | 0.49412 | |
0.37797 | 0.58109 | 0.04094 | 0.26424 | 0.17672 | 0.55904 | |
0.47982 | 0.46952 | 0.05065 | 0.42291 | 0.14872 | 0.42837 | |
0.17888 | 0.314 | 0.50712 | 0.2371 | 0.18903 | 0.57387 | |
0.26791 | 0.17539 | 0.5567 | 0.27013 | 0.17273 | 0.55714 | |
0.64782 | 0.3135 | 0.03868 | 0.51564 | 0.45322 | 0.03113 |
The ternary diagram on Fig. 51 shows these results.
Figure 51. Ternary diagram of the studied data. To the left the CLR transformed data, to the right the same data after being centered using CoDaPack software,
It is clear that even centering does not solve the “problem” of the b.l.d. data, which remain grouped along the AB axis.
Zero, zero… What shall I do with you?[edit]
Geologists, even those that are not knowledgeable of compositional data analysis, have being dealing with this problem for quite some time (Kashdan et al., 1979). One of the most frequently use technique is amalgamation (Aitchison, 1986). Amalgamation, e.g. adding Na_{2}O and K_{2}O, as total alkalis is a solution, but sometimes we need to differentiate between a sodic and a potassic alteration, and therefore amalgamation is not an option.
Pre-classification into groups is another solution, but it requires a good knowledge of the distribution of the data and the geochemical characteristics of the groups that is not always available.
Considering the zero values equal to the limit of detection of the used equipment, or substituting it by some other constant (e.g. half the limit of detection) will generate spurious distributions, especially in ternary diagrams as we show in Fig. 51.
Same situation will occur if we replace the zero values by a very small amount (Bacon-Shone, 2003) using non-parametric or parametric techniques (imputation). Even if we add the same small value to all of the analyzed parameters, we will get the same spurious distribution.
How do I deal with spurious distributions?[edit]
The method that I am proposing takes into consideration the existence of well-known relationships between some elements. For example, in copper porphide deposits, there is always a clear dependency between the copper values and the molybdenum ones (Fig. 52), but while copper will always be above the limit of detection, many of the molybdenum values will include b.d.l. values (“Rounded Zeros”).
Figure 52. As in all Cu porphide deposits, there is a strong correlation between Cu and Mo values. We can use such correlation to estimate the values b.d.l. for Mo.
In this case, I will take the lower quartile of the real molybdenum values (Table 29) and establish a regression equation with copper, and then we will estimate the “Rounded Zero” values of molybdenum by those estimated from their corresponding copper values (Table 30).
Table 29. Values of the lower quartile of real data for Mo from this study.
Au | Cu | Mo |
0.41506 | 0.51552 | 0.06942 |
0.34444 | 0.32052 | 0.33504 |
0.25618 | 0.3309 | 0.41292 |
0.42291 | 0.14872 | 0.42837 |
0.26629 | 0.29193 | 0.44178 |
0.38402 | 0.1694 | 0.44658 |
0.30235 | 0.24198 | 0.45567 |
0.26028 | 0.28088 | 0.45884 |
0.20796 | 0.3302 | 0.46184 |
0.26377 | 0.25831 | 0.47792 |
Table 30. Results of the regression analysis for the lower quartile of real molybdenum data from the studied case.
Regression Statistics | ||||||
Multiple R | 0.792759158 | |||||
R Square | 0.628467082 | |||||
Standard Error | 0.079131742 | |||||
Observations | 10 | |||||
ANOVA | ||||||
df | SS | MS | F | Significance F | ||
Regression | 1 | 0.084737701 | 0.084737701 | 13.53241 | 0.006231332 | |
Residual | 8 | 0.050094661 | 0.006261833 | |||
Total | 9 | 0.134832361 | ||||
Coefficients | Std. Error | t Stat | P-value | |||
Intercept | 0.674594109 | 0.079027652 | 8.536178017 | 2.73E-05 | ||
X Variable 1 | -0.954714886 | 0.259529113 | -3.678642738 | 0.006231 |
Therefore, according to Table 30, we could use regression equation (34) in order to estimate the b.d.l. values for Mo.
Equation 34. Regression equation for Mo.
Fig. 53 shows that the obtained results are close to the predicted line.
So, did we get ride of the spurious effect?
As Fig. 54 clearly shows, only one value of Mo was really close to zero, while the rest has now a value that is a geological reflection of the geochemical characteristics of the data.
Figure 54. The ternary diagram of the left clearly shows that only one sample has low value of Mo. The ternary diagram to the right compares the original data in red with the new estimated values of Mo in green.