Statistics Ground Zero/Statistical Measure

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Statistical Measure[edit | edit source]

Statistical data are assigned to one of four levels of measurement: nominal, ordinal, interval and ratio. This is a shorthand way of claiming that the data can be treated in different ways mathematically. Below I outline briefly the levels and give diagnostics along with an indication of what common procedures are possible for each.

You need to determine the level of measurement of your data before you can decide how they can be analysed. Once you know what level of measure your data belong to you greatly reduce the number of hypothesis testing routines available - so your decision as to what to do with your data is made simpler. The first step towards selecting the appropriate statistical test is to determine the correct statistical measure for the test variable.

I emphasise here the indicator of central tendency because it is a frequent target of testing. When someone wishes to 'compare two groups' they frequently mean that they wish to compare some typical measure of one group against the other and a common typical measure is central tendency, for example the average score. For each level of data what is meant by the average is different.

Nominal[edit | edit source]

We call data nominal if numbers that record observations really stand for names. As an example, take the answer to the question Do you smoke?. A response to this might be recorded 0 meaning no or 1 meaning yes. The data are nominal, but coded by numbers. As a further example consider eye colour. We could decide to categorise the eye colour of each person in our sample according to the following scheme

Eye Colour Numeric code
brown 1
blue 2
green 3
grey 4
other 5

The pairing of category and number is arbitrary. The numbers here stand in for the names of the colours.

Common analytical techniques[edit | edit source]

Because these data are nominal, there is only one permissible mathematical procedure: counting. We can count each instance of each number and record the totals. These are called frequencies. The most common practice in analysing more than one nominal variable is cross tabulation to investigate associations between variables - for example we could determine whether eye colour is associated with or independent of gender. For a two way table a hypothesis can be tested by Pearson's Chi-square test and for more than two by two the likelihood ratio Chi-square.

Indication of the typical[edit | edit source]

The measure of the centre of the data collected in the case of nominal data is the mode.

Ordinal[edit | edit source]

If we can rank data items then we have ordinal data. So, if in races we assign the winner the number 1, the first runner up the number 2 and the second runner up the number three, then these numbers represent ordinal data. We can count and total to get the frequency of each number as with ordinal data, but we can also meaningfully sort the results. There is no regularity to the intervals of ordinal data. Look at a Likert item.

Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree
1 2 3 4 5

We can order the responses as suggested by the numeric codes but we do not assume that the distance from 1 to 2 is the same as from 4 to 5.

Common analytical techniques[edit | edit source]

Ordinal data can be cross tabulated and can be tested for strength of association by one of the non-parametric measures of correlation such as Spearman's Rho or Kendall's Tau. Ordinal data can also be cross tabulated and the chi-square test can be used to determine if there is an association between ordinal variables but this of course will ignore the ranking information in the variable, treating it as purely categorical.

If we have a categorical variable that groups our subjects or cases, then we can compare the rankings of different groups. With two groups we can use the Mann-Whitney U test and with more than two the Kruskall-Wallis test. We can similarly test one group on two variables using the Wilcoxon test.

Scalar Variables[edit | edit source]

Because for many statistical purposes (and with exceptions that are considered uncommon) ratio and interval data are often treated alike, both are often called scalar. The most important exception is the coefficient of variation which should only be computed for ratio data. Sometimes the term continuous will be used to include both interval and ratio data. Strictly, a continuous variable is one for which any value is possible within the limits of its range. To add even more terminological confusion, sometimes the term numeric variable is used to mean a variable that takes a numeric value but not including ordinal data and this is equivalent to the use scalar variable.

Interval[edit | edit source]

Interval data lie on a numbered line where the distance between each point is meaningful and regular: if there are ten points difference between 20 and 30, then there is the same distance between 40 and 50. The zero point on an interval scale is arbitrary. A straightforward example is found in the Celsius temperature scale. On this scale, zero is arbitrarily defined to be the freezing point of water and 100 its boiling point. Intervals between this are determined by calibration (e.g. drawing equally spaced marks on a column of mercury). You can have a reading below zero.

Ratio[edit | edit source]

In the case of ratio data, the measurement scale has regular intervals; there is a true zero point; a value on the scale can be expressed as a ratio of two other values. Consider height in meters: the distance between ten meters and twenty meters is the same as between forty meters and fifty; zero means no height; if someone is two meters tall they are twice the height of someone who is one meter tall.

Common analytical techniques[edit | edit source]

The common descriptive statistics are all computable for scalar variables, for example measures of central tendency (mean, median and mode), of diffusion (variance, standard deviation, range, quartiles), shape (skewnewss and kurtosis). Associations between scalar variables can be established by correlation (Pearson's R) and if they meet certain conditions be further investigated by regression analysis. Commonly used hypothesis tests such as Student's T-test (for two scalar variables or two groups of cases) or the ANOVA (for more than two groups of cases or variables) are used to establish whether means are more or less equal. Regression analysis is often used to model the relationship between two or more scalar variables by a mathematical equation.

Notes[edit | edit source]


Contents[edit | edit source]

1 Introduction

2 Statistical Measure

3 Parametric and Non-parametric Methods

4 Descriptive Statistics

5 Inferential Statistics: hypothesis testing

6 Degrees of freedom

7 Significance

8 Association

9 Comparing groups or variables

10 Regression