Handbook of Descriptive Statistics/Introduction

From Wikibooks, open books for an open world
Jump to navigation Jump to search

A distribution is a collection of measurements of a single phenomenon. If the number of responses is small, just listing them all may be adequate description. In this case, no summarization or data reduction is needed. However, if the number of measurements is large, a full list may fail as a communication or analysis tool.

Happily, distributions can be summarized. Some summaries are very brief and provide just a little description. For instance, the mean is a one number summary that captures just one aspect of a set of numbers. There are six aspects of a distribution that may warrant consideration when summarizing.

Sample size[edit | edit source]

The number of subjects or items measured is often of fundamental interest. For instance, we might be examing a distribution of the heights of college students containing measurements on 1,234 subjects.

Scale and precision[edit | edit source]

The type of data you are dealing with (continuous, categorical, ordinal, etc.) influences many of the choices you must make about how to describe and analyze your data. The units of measure (inches, kilos, %, mmol, drachmas per acre of corn, etc.) should be noted. For our example, the data were recorded in inches to the nearest 0.1 inch. In other words, measurements were rounded to the nearest tenth of an inch before recording. For categorical data, the "scale" is just the names of the categories. If we also recorded the gender of the students, we might have three categories: "Male," "Female" and "Unknown."

Central tendency[edit | edit source]

Along the scale, about where do the data lie? Theoretically, adult human height is measured on a scale that goes on to infinity. However, most of the measurements we will observe center around a value of 68 inches (5'-8"). There are a large variety of ways to describe the central tendency. For continuous data, the mean (or average) is often calculated. But the mean has limitations and other measures of central tendency are useful: median, geometric mean, mode, etc.

Spread[edit | edit source]

Central tendency tell you where the data tend to be, but not all of the data will have the same value. Usually, some will be higher and some will be lower. The college students have heights that lie between about four feet (48 inches) and about 7 feet (84 inches). The spread in the data can be summarized in many ways: range, variance, standard deviation, inter-quartile range, etc.

Shape[edit | edit source]

Shape is the richest aspect of the distribution and often the most difficult to summarize. For many measurements, the bulk of the data occur at the middle and the number of observations tails off at higher and lower values. The classic normal distribution is one such distribution: it has a "bell-shape." If all the values in the range of data occur about as often as each other, then the shape is said to be "uniform" or "rectangular." If all the data tend to be at one end of the range, with a declining number of cases observed at the other, the distibution is said to be "skewed."

There are numerical methods for describing the shape of a distribution. For instance, the degree of skew can be calculated and reported. However, often the best way to describe the shape is to draw a picture of the data such as a histogram.

Outliers[edit | edit source]

Often, a small number of observations have values substantially above or below the bulk of the data. These outliers are sometimes disturbing or confusing, but they can be very, very, important. The best way to summarize outliers is often just to note them individually or highlight them in the graphical description.