Statistics Ground Zero/Descriptive Statistics

Descriptive Statistics

Descriptive statistics summarize the quantitative character of your data set. They are used to describe data so as to illustrate how some phenomenon appears in the cases observed. Descriptive statistics answer questions like What proportion of the cases have blue eyes? or What is the typical household income of the cases observed?. In computing descriptive statistics, we do not intend to make any inference from the data collected to any population larger than or outside of the cases observed.

Population and Sample

Where I give a calculation in Descriptive Statistics I will give the formula for the population parameter unless otherwise specified. The population parameter is used when data have been collected for all the cases under investigation and are used in calculation. If the data set is sampled (perhaps by collecting data for some representative number of cases) then the sample statistic is used instead. The parameter and statistic differ and for the parameter the mean is computed over the total number of cases in the data set (N) and this population mean is used in further computations or parameters. For the sample statistic the mean is computed over the total number of cases minus one (N-1) and this sample mean is used in the computation of further statistics. Thus the sample statistic approximates the size effect of a sample, since as N increases (approaches the size of the population) the difference between sample and population decreases. For a small number of cases subtracting one from N has a very large effect; for a large number of cases subtracting one from N has a much smaller effect.

Frequency

Frequencies can be computed for both nominal, ordinal variables and for continuous variables - though with a slightly different meaning. For a discrete (that is nominal or ordinal) variable, the frequency is the count of instances of the level in the data. For a continuous variable, it is common to bin the values observed into groups of particular width (for example a bin might contain scores between 0 and 5, the next between 6 and 10 and so on.

Dealing with a single variable (univariate)

If we imagine a class of school students and test scores, we know that a number of students might score 50/100, another group 65/100 and so on. The number scoring at each level is the frequency of that score. If we record these frequencies we have the frequency distribution of that variable. We can tabulate frequencies as counts, percentages and cumulative percentages of the data.

The following table tabulates age data. For each age to the nearest year encountered in the data, the frequency is counted and the absolute figure recorded, and percentages calculated.

 AGE Frequency Percent Valid Percent Cumulative Percent 10.00 5 17.9 17.9 0 + 17.9=17.9 11.00 10 35.7 35.7 17.9+35.7=53.6 12.00 10 35.7 35.7 53.6+35.7=89.3 13.00 3 10.7 10.7 89.3+10.7=100.0 Total 28 100.0 100.0

In this example, the valid percent column is identical to the percent column because there are no missing data, that is cases for which the age is unknown.

Dealing with more variables (bivariate/multivariate) Cross Tabulation

We can describe the intersection of two categorical (that is nominal and ordinal) variables by crosstabulation. Here, I crosstabulate eye colour by gender for a group of 76 students, equally divided by gender. In this case there are two columns and five rows: this is a two by five table. No significance is attached to which variable goes in rows or which in columns.

 Crosstabulation eye colour x gender gender f m eyecolour blue 6 6 brown 12 12 green 7 7 grey 4 6 other 9 7 Total 38 38

Each cell in the table holds the count of how many students of each gender were observed to have just that eye colour. So, six of the male students had brown eyes and four of the female students had grey eyes and so on. These are the observed counts for the crosstabulation of these variables. Later we will see that these can be compared to the expected counts predicted by probabilities.

Central Tendency

A common summary for numerical data is the location of the central point or middle of the data. This point is taken as an indicative answer to the question what is the most typical value for this variable? There is more than one way to determine the centre. I explain the three most common measures of central tendency below.

Mode

The mode is the most frequently occurring value in the data. If we go through the observations and tick off once for each occurrence of a particular score we obtain the frequency count for the data. The mode is the value with the highest frequency count. There is no guarantee that there will be a single modal value and so sometimes we hear data described as bi-modal or multi-modal.

The mode is the least powerful of the measures of central tendency since it exploits so little information from the data.

The mode is the only measure of central tendency that can be computed for nominal data. It can also be computed for ordinal data.

The mode is often visualised with a histogram.

Example

Suppose that we tally the age to the nearest whole year of a class of schoolchildren and we get the following result

Frequency Distribution of Age in Class
Age in Years Frequency of Occurrence
10 5
11 10
12 10
13 3

Here there are two modal values: 11 years and 12 years. This is a bimodal distribution of scores.

Here is a histogram of the data:

A simple chart of frequency data

Median

The median is the middle score in the data set. The scores should be ranked and then if the number of cases is odd, the median is the middle ranking score. If the numbers are even, then the two midpoints are summed and divided by two to calculate the median.

The median exploits more information about the data than the mode since the data are ranked, and is a more powerful expression of the central tendency.

The median can be computed for ordinal, interval and ratio data.

Example

Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. This is an even number, so we take the two middle values, add them and divide by two. This gives 11.5 years as the median value.

Mean

Here, I take mean to be the arithmetic mean or average - ignoring items like the geometric and harmonic means.

The mean is calculated as the sum of all scores for a variable divided by the number of cases. The mean is the most powerful indicator of central tendency, exploiting the most information from the data. The formula is often written as

${\displaystyle {\frac {\Sigma {x}}{N}}}$

The mean can be computed only for interval and ratio data.[1]

Example

Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. The sum of the ages in years for this class is 319. The total number of cases is 28 that is N = 28. So we divide 319 by 28 to get the mean age: 11.39 years.

${\displaystyle {\frac {319}{28}}=11.39}$

Dispersion

Dispersion is the degree of spread of values in a data set. Variation is central to statistical thinking. I will introduce some of the main indicators of the dispersion of values.

Dispersion is important in descriptive statistics because two groups or two variables might have similar means, medians or modes for example, but differ widely in dispersion. For example, it is possible that the mean income in Mumbai and Los Angeles are the same (I have no idea; I haven't checked) but you would not be surprised to discover that the spread of income across the populations of these two cities was very different.

Range

The range of a data set is the distance between the highest observed value and the lowest observed value of a variable.

Example

Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. The maximum value of age is 13 and the minimum value of age is 10. Range is therefore 13 -10 which is 3.

Quartiles and the Interquartile Range

Quartiles

The quartiles are three points in the data that divide the cases into equal fourths. One of the quartile points is the median - quartile two. The first quartile cuts off the lowest 25% of the data set and the third quartile cuts off the highest 25% of the data set.

Interquartile Range

The interquartile range (IQR) is defined as quartile 3 minus quartile 1. This is a robust indication of the spread of values around the median. One characterisation of outlier defines it as a point more than one and a half times the IQR from the boundary of the IQR. Outlier is understood as a value so extreme as to be untypical.

Example

Consider the following data consisting of 32 pupils scores in mathematics. The table shows frequency counts and cumulative percentages.

Exam Score Count Cumulative Percent
39 1 3.125
42 1 6.250
44 1 9.375
45 1 12.500
47 1 15.625
48 1 18.750
50 3 28.125
51 1 31.250
52 3 40.625
53 2 46.875
54 1 50.000
55 2 56.250
56 3 65.625
57 1 68.750
58 2 75.000
59 1 78.125
60 2 84.375
62 2 90.625
63 1 93.750
64 2 100.00
Total 32 100

The median score is 54.5. The quartiles are

Quartile Score
First (lowest 25%) 50
Second (to median) 54.5
Third (to 75%) 58.5

So the interquartile range is 58.5 - 50 = 8.5. This is visualised by a boxplot

Boxplot showing IQR as shaded area

The interquartile range is shown as a shaded box with a line indicating the location of the median score. Also indicated are the minimum and maximum scores,shown by the 'whiskers'. On this box plot the whiskers represent the actual minimum and maximum values. Some plots indicate IQR + or - 1.5(IQR) instead of the minimum and maximum.

Deviation

Deviation measures the distance between an observed score and the expected for the variable under consideration (or perhaps the distance from some ideal value, in which case we often call the deviation the error). For a continuous variable, the expected value is the mean.

Consider the data in the table above. There are four values present in the data: 10, 11, 12, 13. The mean age is 11.39 years. To calculate the deviation of a score, for example 13, from the mean we take 11.39 from 13 to get 1.61. We notice that the deviation can be a positive or negative distance. Thus, taking a score of 11 years we calculate the deviation from the mean to be -0.39 years.

It would be useful to be able to characterise the diffusion in a data set by the average deviation from the mean but we will see that initially it turns out to be more straightforward to deal with the average squared deviation from the mean.

Variance

Variance is the mean squared deviation of a dataset. If we remember that variance is a mean, the definition becomes very easy to understand. The formula for the population variance is

${\displaystyle {\frac {\Sigma {(x-{\bar {x}})^{2}}}{N}}}$

The top half of this formula is the sum of squares. The sum of squares of the deviations divided by the number of cases is the variance. It is the average distance of a score in the data set from the mean for that variable.

Example

Consider the following set of values {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. This set has a mean 5.5: if we sum the deviations we get zero. To avoid trying to divide zero by N we square the deviations. So, tabulated we get

Deviations and Squared Deviations
Value Deviation Squared Deviation
1 -4.5 20.25
2 -3.5 12.25
3 -2.5 6.25
4 -1.5 2.25
5 -0.5 0.25
6 0.5 0.25
7 1.5 2.25
8 2.5 6.25
9 3.5 12.25
10 4.5 20.25
Sum of Squares 82.50

After the squaring operation, we arrive at a figure for the sum of the squared deviations which we can divide by N to get the variance. Since N is 10, the variance is 8.25.

Understanding of variance

This measure, the variance, is a very useful summary statistic for the dispersion in our data. Moreover, variance plays a central role in statistical thinking. Many common statistical techniques involve the computation and comparison of the variances of samples, populations or between variables. It suffers however from one draw back: suppose that the original variable represented height in meters, the variance is now expressed in meters squared. We have transformed a linear measure into a measure of area, a geometric measure. Squaring the deviations avoids a zero result, but the final figure is expressed in different units than the original. The solution lies in the derivation of the standard deviation.

Standard Deviation

The standard deviation is calculated straightforwardly as the square root of the variance. So the formula can be written

${\displaystyle {\sqrt {\frac {\Sigma {(x-{\bar {x}})^{2}}}{N}}}}$

This quantity is now in the same units as the original values, overcoming the limitation on interpreting the variance.

Informally we might say that for a randomly distributed variable observations are typically within one or two standard deviations of the mean and we will see that we can be more precise below.

Shape

Skewness

Skewness tells you to what extent the distribution of values is symmetrical around the mean. If the distribution of values is symmetrical around the mean then the skew is nil. The normal or Gaussian distribution looks like this:

In this illustration the mean is labelled μ, the standard deviation is σ

This distribution of values can be expressed in terms of standard deviation. Around 68% of values lie within one standard deviation from the mean. Ninety-six percent or so are within two standard deviations from the mean. Much smaller fractions of the data set have values beyond two standard deviations. Further, in a normal distribution the median and the mean will be very close in value, in fact for an ideal normal distribution mean = median = mode.

Distributions may be skewed with a long tail to the left - negative skew; or a long tail to the right - positive skew.

Examples of skewness

Kurtosis[2]

Kurtosis refers to the tailedness of the data. A distribution with a high kurtosis has tails (occasional extreme values) that are more extreme (heavier) than the tails of a normal distribution. The red line D in the graph below shows such a distribution, but high kurtosis generally does not correspond to such a pointy peak. A distribution with a low kurtosis has tails that are less extreme (lighter) than the tails of the normal distribution. The blue line W in the graph is an example of such a distribution, but low kurtosis generally does not tell you anything about the peak (the beta(.5,10) is an example of an infinitely pointy distribution with an infinitely pointy peak). The normal distribution (the black line, N) has a kurtosis of zero.

In a data set with high kurtosis - long tails - more of the variability in the data is due to relatively infrequent extreme deviations from the mean for that variable. In a data set with low kurtosis, more of the variability in the data is due to moderate but frequent deviations.

The following graph illustrates the kurtosis of some well known distributions. Note, however, that the tails are not easily seen in such density graphs: Even when the distribution has "fat tails," the tails are still close to zero and not easily compared. Thus, it is difficult to discern kurtosis from these graphs. A better way to visualize the tails in reference to the normal distribution (i.e., kurtosis) is to use a normal quantile-quantile plot.

Notes

1. Occasionally you will see the mean given as the measure of central tendency for ordinal data and this can be justified if you are persuaded that underlying the rank scale there is a relatively isomorphous interval scale.
2. Terminology: in this section we will talk about what is technically excess kurtosis. In calculating excess kurtosis we adjust so that the kurtosis of the normal distribution is zero (rather than 3).

Contents

4 Descriptive Statistics