Handbook of Descriptive Statistics/Measures of Statistical Variability/Variance
- Brief description of what it is and what its used for.
- Mathematical formula
- Link to a Wikipedia article on it.
Describe in detail here data sets and purposes on which it is most useful. Describe also data sets for which it may give misleading results.
- Include standard values for common distributions, if they exist. For example, a normal distribution always has a skew and kurtosis of zero.
- Include standard error for normal distribution, and for other distributions also if possible.
- Sampling distribution.
A formula for calculating the variance σ2 of a population of size N is:
stands for the population mean.
A formula for calculating the sample variance s2, the unbiased estimation of the population variance, from a sample of size n, is:
stands for the sample mean.
Therefore a simple algorithm to calculate variance can be described by the following pseudocode:
double sum; double sum_sqr; double variance; long n = data.length; // the number of elements in the data array (the actual syntax is language-specific) for i = 0 to n sum += data[i]; sum_sqr += ( data[i] * data[i] ); end for variance = ((n * sum_sqr) - (sum * sum))/(n*(n-1));
Another algorithm which avoids large numbers in sum_sqr while summing up
double avg; double var; long n = data.length; // number of elements for i = 0 to n avg = (avg*i + data[i]) / (i + 1); if (i > 0) var += (var * (i - 1) + (data[i] - avg)*(data[i] - avg)) / i; end for return var; // resulting variance
'corrected two pass algorithm' of Chan, Golub, Leveque (American Statistician, 1983, page 242):
avg=0 for all i avg=avg+data[i] end for avg=avg/n sum=0 sum2=0 for all i sum2=sum2+(data[i]-avg)^2 sum=sum+(data[i]-avg) end for var=(sum2-sum/n)/(n-1) return var
Analysis of Algorithms
It is important, when dealing with large data sets or large numbers, on computers, to choose an algorithm which minimizes error due to approximate calculations by the computer. Merge in below discussion of the above algorithms.
It is actually the first formula that has precision problems when dealing with limited-precision arithmetic. If the difference between measurements and the mean is very small, then the first formula will yield precision problems, as information will be lost in the (xi - µ) operation. There is no such loss of significance in the intermediate operations of the second formula.
In fact, the second formula is the one more commonly beset with problems. The first can have problems when the mean is very large relative to the variance, but this is relatively rare in practice and this problem also affects the second formula. Much more common is the situation where you have comparable mean and variance and a very large number of observations. In this case, the second formula will result in the subtraction of two very large numbers whose difference is relatively small (by a factor roughly equal to the number of observations). If you have one million observations, you lose roughly six significant figures with the second formula if you use ordinary floating point arithmetic.
The problem may occur
- when the deviations are very small relative to the mean or
- when they are small relative to the representational capacity of the arithmetic instrument (floating point computer, fixed point calculator, paper and pencil).
To be precise we have to specify the instrument and the nature of the data.
What do you mean by ordinary floating point arithmetic? Are talking about precision with respect to register width, or is there an implicit reference to some "non-ordinary" scheme, that is, non-IEEE? Thanks. I thought the reference was pretty explicit. I didn't mean, however, to imply the use of non-standard but still floating point arithmetic. I had in mind systems that were entirely different. A simple example is a system that uses rationals based on unlimited precision integers. Another system might be fixed point arithmetic. The first would not lose precision in the computation while the second would lose more or less precision than I quote depending on the values in question and the characteristics of the fixed point system. You could also imagine an optimizer that noted that you might be adding up large numbers of very small numbers and would re-order the addition to avoid most of the round-off error by using log N temporary values to perform the addition.
Include how it is accessed in common statistical packages, if known. Can also provide links to source code samples for calculating it, etc.