Social Statistics/Chapter 4

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Means and Standard Deviations[edit | edit source]

North Americans, Europeans, Japanese, Australians, Koreans, and New Zealanders, and people in a few other countries are very lucky. However difficult life may be for individual people, our countries are very rich. If we in the rich world have problems like poverty, homelessness, and malnutrition, it's because we choose to have them. We could always just decide to spend the money to make sure that everyone could live a decent life. We may choose not to spend the money, but at least we have the choice. National income per person in the rich countries of the world is typically $30,000 - $45,000 per person per year, and all of the rich countries of the world have democratic governments. It's up to us to decide how we want to spend our resources. In many of the poorer countries of the world, the resources simply do not exist to make sure that everyone has a decent standard of living. What's worse, many of these countries aren't democracies, so even where resources do exist people don't necessarily have the power to choose to share them out equally. As a result, over one-third of the world's children under age 5 are stunted (shorter than they should be) due to malnutrition. Over 20% of the world's population can't afford to eat on a regular basis. About 40% of the people of the world -- roughly 2.5 billion people -- when they have to go to the bathroom literally shit on the ground. Another 30% use outhouses. Only about 30% of the world's people have toilets with running water. It's hard to wash after you wipe if there's no running water in your bathroom. Most rich countries make some effort to help make conditions better for the world's poor. Some basic data on rich countries' overseas development assistance (ODA) budgets are presented in Figure 4-1. Overseas development assistance is the amount a country spends on aid to help people in poorer countries. This database draws together data from the World Bank and the Organization for Economic Cooperation and Development (OECD). The cases in the database are 20 of the richest countries in the world, including the United States. Included are two metadata items, the countries' names and three-digit country codes. Four variables are also included: AID/GNP -- a country's ODA spending in relationship to its total national income ADMIN/AID -- the proportion of a country's aid that is spent on administrative costs MIL/GNP -- a country's military spending in relationship to its national income GDP_2008 -- a country's level of national income per capita EUROPEAN -- an indicator that a country is European (1) versus non-European (0)

Figure 4-1. Database of overseas development assistance (ODA) and related figures for 20 rich countries from OECD and World Bank sources, 2008

The twenty rich countries included in Figure 4-1 are ranked by the generosity of their ODA spending in Figure 4-2. Compared to other rich countries, the United States comes in dead last. The United States spends less on aid (as a proportion of its total income) than any other country, 0.19%. Other countries are more generous, but not much more generous. Australia and Canada give 34 cents of every $100. France and Germany give 40 cents. Norway, Luxembourg, and Sweden are the most generous, giving about 1% of their total national incomes to help others. To match the most generous countries in the world, the United States would have to quintuple its annual ODA spending.

Figure 4-2. Aid generosity rankings for 20 rich countries, 2008

An interesting pattern in ODA spending that is made clear by Figure 4-2 is that all of the most generous countries are European. We might generalize from this observation to hypothesize that European country status is an important determinant of ODA spending levels. The results of a regression of ODA spending on European country status are reported in Figure 4-3. The intercept is 0.27, which is the expected value of ODA spending when European status = 0. In other words, rich non-European countries tend to give about 0.27% of their national incomes. The regression coefficients in Figure 4-3 can also be used to calculate the expected value of ODA spending for European countries. For European countries, European status = 1, so ODA spending = 0.27 + 0.33 x 1 = 0.60% of national income. This expected value could be used to predict ODA spending in a rich European country that was not included in the database, like Liechtenstein. Based on the regression reported in Figure 4-3 the predicted value of ODA spending for Liechtenstein would be 0.60% of national income. Since Liechtenstein is a European country (European status = 1), this prediction would be an interpolation, not an extrapolation.

Figure 4-3. Regression of ODA spending on European country status, 2008

A scatter plot of the relationship between European country status and ODA spending is depicted in Figure 4-4. A regression line has been plotted on the graph. The expected values of ODA spending for both non-European and European countries have also been noted. As with any regression model, the regression line graphed in Figure 4-4 passes through the middle of the scatter of the data. The only thing that is different from the scatter plots in Chapter 1 and Chapter 2 is that the independent variable in Figure 4-4 takes only two values. As a result, all the points line up over either European status = 0 or European status = 1. This has no effect on the meaning of the regression line or how it is calculated. The line still represents the most likely value of the dependent variable (ODA spending) for any given level of the independent variable (European country status). Similarly, deviations from the regression line still represent error.

Figure 4-4. ODA spending versus European country status, 2008

This chapter explains how expected values and errors can be used to describe and compare variables. First, even one variable alone can have an expected value, without any need for a linear regression model (Section 4.1). A new model, the means model, is introduced to define the expected value of a variable when there are no other variables involved in the analysis. Second, any expected value is associated with error, since in most cases the values of variables don't equal their expected values (Section 4.2). In both mean models and regression models the errors balance each other out and average out to zero. Third, in both mean models and regression models the amount of error can be measured using a standard deviation (Section 4.3). Most of the data used in a statistical model fall within the standard deviation of the error. An optional section (Section 4.4) demonstrates how standard deviations are actually calculated by statistical computer programs. Finally, this chapter ends with an applied case study of income and employment levels across the 33 political divisions of China (Section 4.5). This case study illustrates how means can be used to compare variables. It also shows how regression standard deviations are related to the standard deviations of variables. All of this chapter's key concepts are used in this case study. By the end of this chapter, should have gained a much deeper understanding of the role played by error in statistical models.

4.1. The mean model As Figure 4.4 demonstrates, a regression model can be used to calculate the expected value of Overseas Development Assistance (ODA) either for non-European countries or for European countries. The expected value of a dependent variable for a specific group of cases (like non-European or European countries) is known as a conditional mean. Conditional means are the expected values of dependent variables for specific groups of cases. Another example of the use of conditional means is illustrated in Figure 4-5. Figure 4-5 depicts a scatter plot and regression of wage income on gender using data for employed Americans aged 20-29 from the 2008 US Survey of Income and Program Participation (SIPP), Wave 2. The SIPP database includes 4964 cases (2208 women and 2756 men). Since these would be too many to plot on a scatter plot, 100 random cases (46 women and 54 men) have been graphed in Figure 4-5 to illustrate what the data look like.

Figure 4-5. Wage income versus gender for a random sample of 100 employed SIPP subjects ages 20-29 (2008)

The coefficients of the regression of income on gender are reported in Figure 4-6. In this regression model, the independent variable is gender (coded as "maleness": 0 for women and 1 for men) and the dependent variable is wage income (defined as income earned through working a job and calculated as twelve times the monthly income recorded in the SIPP). The regression model has an intercept of 33876 and a slope of 4866. In other words, the equation for the regression line is Income = 33876 + 4966 x Male. For women (Male = 0), the expected value of wage income is 33876 + 4966 x 0 = 33876 + 0 = $33,876. For men (Male = 1), the expected value of wage income is 33876 + 4966 x 1 = 33876 + 4966 = $38,842. In other words, the conditional mean income for women is $33,876 while the conditional mean income for men is $38,842.

Figure 4-6. Table of regression results for the regression of wage income on maleness (from Figure 4-5 but using data from all 4964 cases)

If it's possible to calculate conditional mean incomes based on people's genders, it should be possible to calculate the mean income for people for people in general. Means are the expected values of variables. What would happen if we put all 4964 people in the SIPP database together into one big group and calculated the expected value of their income? The result would look something like Figure 4-7, which takes the 46 women and 54 men from Figure 4-5 and groups them into a single category called "people."

Figure 4-7. Wage income for a random sample of 100 employed SIPP subjects ages 20-29 (2008)

The mean income of all 4964 employed Americans age 20-29 is $36,633. The mean income can be calculated by adding up the incomes of all 4964 people and dividing by 4964. This is what most people would call the "average" value of a variable. Social scientists usually use the term "mean" instead of the term "average" because "average" can also mean "typical" or "ordinary." The term "mean" always means just one thing: it is the expected value of a variable, calculated by summing up the values of all the individual cases of a variable and dividing by the number of cases. The mean is more than just a mathematical calculation. Like the mean income of $36,633 for all twentysomethings, the mean incomes for women ($33,876) and for men ($38,842) could have been calculated by summing up all the incomes of the women or men in the database and dividing by the number of women or men. The conditional means of income for women and men from the regression model in Figure 4-6 are identical to the individual means of income for women and men. The difference is that calculating the conditional means using a regression model provided both an equation and a statistical model for thinking of the conditional means as predicted values. Based on the regression model for income (Figure 4-6), any employed twentysomething American woman would be predicted to have an income of $33,876. Any employed twentysomething American man would be predicted to have an income of $38,842. What would be the predicted income of an employed twentysomething American in general, if the SIPP database had included no information on gender? Obviously, the answer would be $36,633, the mean income for all 4964 people in the database. The statistical model behind this prediction is a mean model. Mean models are very simple statistical models in which a variable has just one expected value, its mean. The mean model can be thought of as a linear regression model with no independent variable. If you squeeze all the data from Figure 4-5 into a single group like in Figure 4-7, you turn a linear regression model into a mean model. The big difference between using a mean model as a statistical model and just calculating a mean by adding up all the values and dividing by the number of cases is how you think about it. In the mean model, the mean is an expected value, not just a bunch of arithmetic. Each time an individual case deviates from the mean, that deviation is a form of error. In a linear regression model, regression error is the degree to which an expected value of a dependent variable differs from its actual value. In the mean model, error is the degree to which the mean of a variable differs from its actual value. In the mean model, if a person earns $30,000 per year, that income can be divided into two parts: the mean income ($36,633) and error ($6633). If another person earns $40,000 a year, that income can be divided into two parts: the mean income ($36,633) and error ($3367). In the mean model, your income isn't just your income. Your income is composed of the mean income for a person like you, plus or minus some error.

4.2. Models, parameters, and degrees of freedom Smoking causes more preventable disability and death worldwide than any other human activity. It is an incredibly important challenge to the world's health. In Canada, about 17.9% of the adult population identify themselves as smokers (Health Canada data for 2008). Smoking rates, heavy drinking rates, and temperatures across the 13 Canadian provinces and territories are summarized in the database in Figure 4-8. The mean rate of smoking across these 13 political divisions is 20.3%. This differs from the overall national average because several low-population provinces and territories have high smoking rates. A mean model for smoking rates in Canadian provinces and territories would suggest that smoking rates equal an expected value of 20.3% plus or minus some error in each case.

Figure 4-8. Smoking data for 13 Canadian provinces and territories, 2008

The mean model is a very simple approach to understanding smoking rates. It says something about smoking rates -- that they're not 0% or 50% -- but doesn't say anything about why smoking rates differ from province to province. All the variability in smoking rates across provinces is considered to be error in the model. A regression model might help explain some of the differences in smoking rates across Canada's 13 provinces and territories. One theory of the differences in smoking rates might be that smoking rates depend on the weather. Canada is cold. The mean annual temperature across the capital cities of Canada's 13 provinces and territories is less 38 degrees Fahrenheit. This is much colder than New York (57 degrees), Chicago (51 degrees), or Los Angeles (66 degrees). Even Minneapolis (average annual temperature 45 degrees) and Fargo (41 degrees) are warmer than most of Canada. One theory might be that some people smoke because they get bored when they can't go out in the cold weather. A specific hypothesis based on this theory would be that smoking rates rise as the average temperature falls. The results of a regression model using average annual temperature as the independent variable and the smoking rate as the dependent variable are presented in Figure 4-9.

Figure 4-9. Regression of smoking rates on average temperatures across the 13 Canadian provinces and territories, 2008

The intercept of 37.00 means that a province with an annual average temperature of 0 degrees would have an expected smoking rate of 37.0%. Since none of Canada's provinces is this cold, the intercept is an extrapolation. Starting at the intercept of 37.0%, the expected value of the smoking rate declines by 0.44% for every 1 degree increase in temperature. As predicted by the boredom theory of smoking, smoking rates fall as the temperature rises. Which model is better for understanding smoking rates, the mean model or the linear regression model? Both provide expected values. The relationship between the mean model and the regression model for smoking is graphed in Figure 4-10. The left side of Figure 4-10 depicts the mean model for smoking, lining up all the provinces just like the SIPP respondents in Figure 4-7. The right side of Figure 4-10 depicts the regression model for smoking, spreading the provinces out according to their temperatures. Arrows show how the data points in the mean model correspond to the data points in the regression model for four illustrative provinces. In the case of smoking in Canadian provinces, the regression model seems to explain more about smoking than the mean model. Given the availability of temperature data, the regression model seems more useful than the mean model.

Figure 4-10. Illustration of mean and regression models of smoking rates across the 13 Canadian provinces and territories, 2008

The mean model in Figure 4-10 gives an expected value for the overall level of smoking using just one figure (the mean) while the regression model gives different expected values of smoking for each province using two figures (the intercept and the slope). These figures are called parameters. Parameters are the figures associated with statistical models, like means and regression coefficients. Calculating parameters like means and regression coefficients require data. In the Canadian province data (Figure 4-8) there's plenty of data to calculate both the mean and the regression coefficients. Usually it's not a problem to have enough data to calculate the parameters of a model, but when there are very few data points there can be problems. What if you had a database with just one case? For example, you might want to study the population of the world in 2010. The population of the world is around 6.7 billion people. Can you model the population of the world using a mean model? Yes, the population of the world in 2010 has a mean of 6.7 billion people. There's no error in this mean model, because there's only one case -- the world -- and its actual population is equal to the mean. With a database of 1 case, it is possible to calculate the 1 parameter of the mean model, the mean. Could you study the population of the world in 2010 using a linear regression model? You might hypothesize that population is related to rainfall. If the world were all one big dry desert, it would be expected to have a small population. If the world were all a lush green paradise, it would be expected to have a large population. This is a good idea, but the problem is that there is only one world to study. It is impossible to calculate the affect of rainfall on the population of the world when there is only one world to study. Regression models require the calculation of two parameters, and it turns out that you have to have a database of at least two cases in order to calculate both a slope and an intercept. What if you had a database with two cases? For example, you might want to model Korean populations. There are two Korean countries, North Korea and South Korea. North Korea has a population of 24 million people, while South Korea has a population of 48 million. Using the mean model, the expected value of the population of a Korean country is the mean of these two cases, or 36 million people. Both North Korea and South Korea have an error of 12 million (North Korea has 12 million less people than the mean, while South Korea has 12 million more than the mean). Even though it seems like both cases have independent errors, in fact there is only one level of error in the model. If North Korea is 12 million below the mean, South Korea has to be 12 million above the mean to balance it out. There are two errors, but only one of them is free to vary. This quirky mathematical fact means that in the mean model, every case isn't free to vary at random. If a variable has 2 cases, and you know the mean of the variable, then only 1 case can vary freely. The other case has to balance out the first case. If there are three cases, then only two can vary freely. More generally, if there are N cases, and you know the mean, only N-1 cases are free to vary. This number, N-1, is knows as the degrees of freedom of a mean model. Degrees of freedom are the number of errors in a model that are actually free to vary. The degrees of freedom of a mean model is N-1 because the mean model has only one parameter, the mean. On the other hand, the degrees of freedom of a regression model is N-2, because the regression model has two parameters (the slope and intercept). That means that there have to be at least two cases in a database in order to use a regression model. Since most databases have dozens or hundreds of cases, this usually isn't a problem. The main use of degrees of freedom is in making statistical calculations about error. The total amount of error in a statistical model depends on the total number of degrees of freedom, not on the total number of cases. Statistical computer programs use degrees of freedom in calculating many of the figures associated with statistical models, and usually report the degrees of freedom of the model as part of their output of model results. The basic idea, though, is just that any statistical model uses up one degree of freedom for every parameter it calculates. A mean model with 1 parameter based on N cases has N-1 degree of freedom. A linear regression model with 2 parameters has N-2 degrees of freedom. No model can have negative degrees of freedom, so it takes at least 1 case to use a mean model and 2 cases to use a regression model.

4.3. Standard deviation and regression error All statistical models that use parameters to produce expected values (like the mean model and the linear regression model) produce model error. All this means is that statistical models usually describe the world perfectly well. All statistical models are simplifications of the real world, so they all have error. The error in a mean model is usually just called error or deviation from the mean, while the error in a regression model is usually called regression error. In the mean model, the model explains none of the variability in the variable. The mean model has only one parameter, the mean, and all of the variability in the variable becomes error in the mean model. As a result, the spread of values of the error is just as wide as the spread of values of the variable itself. This spread can be measured and expressed as a number. The most commonly used measure of the spread of a variable is the standard deviation. Standard deviation is a measure of the amount of spread in a variable, which is the same thing as the amount of spread in the error in a mean model. The standard deviation of a variable (or the standard deviation of the error in a mean model) depends on two things: the amount of error in the mean model and the number of degrees of freedom in the mean model. For the smoking rates of the 13 Canadian provinces and territories, the standard deviation is 5.3%. In the linear regression model, some portion of the variability in the dependent variable is accounted for by variation in the independent variable. This is illustrated in Figure 4-10, where the smoking rates of the 13 Canadian provinces and territories are spread out over the levels of their average annual temperatures. If you look carefully at Figure 4-10, you'll see that the regression errors (the differences between the expected values on the regression line and the actual values of smoking rates on the right side of the chart) look pretty small compared to the overall variation in smoking rates (from the left side of the chart). Part of the variation in smoking goes into the regression line and part of the variability in smoking goes into error. As a result of this, the overall level of error in a regression model is always smaller than the overall level of error in the corresponding mean model. Errors from both kinds of model are directly compared in Figure 4-11 for the Canadian provincial smoking data. The table in Figure 4-11 shows the expected values and associated errors for each province for the mean model and for the regression model. The expected value in the mean model is always 20.3% (the mean). The expected value for each province in the regression model is calculated from the equation for the regression of smoking on temperature (Figure 4-9). As the table in Figure 4-11 shows, the errors in the regression model are usually smaller in size than the errors in the mean model. The difference is biggest for the provinces with the biggest errors. The largest error in the regression model is 5.8% (Yukon). Four different provinces (including Yukon) have errors larger then 5.8% in the mean model.

Figure 4-11. Comparison of model error in mean and regression models of smoking rates across the 13 Canadian provinces and territories, 2008

The standard deviation of the model error in the regression model is 3.1%. This is called regression model standard deviation. Regression error standard deviation is a measure of the amount of spread in the error in a regression model. Regression error standard deviation is based on the errors in the regression model and the degrees of freedom of the regression model. Regression error standard deviation for a given regression model is almost always smaller than the standard deviation from the corresponding mean model. In fact, the coefficients of regression models (slopes and intercepts) are selected in such a way as the produce the lowest possible regression error standard deviation. Standard deviations measure the spread of the error in a model. A higher standard deviation means more error. Figure 4-12 illustrates the spread of the error for the mean model and regression model for Canadian provincial smoking rates. The error figures plotted in Figure 4-12 are taken directly from the two error columns in Figure 4-11. Some of the provinces and territories with the largest errors are marked on the chart. In each model, the errors of most of the provinces and territories fall within one standard deviation of zero. The error standard deviation of the mean model is 5.3%, and 9 out of 13 provinces fall between +5.3% and -5.3%. All 13 provinces fall within two standard deviations (between +10.6% and -10.6%). Figure 4-12. Illustration of standard deviation and regression error standard deviation for mean and regression models of smoking rates across the 13 Canadian provinces and territories, 2008

For the regression model, the error standard deviation is smaller, but still 9 out of 13 provinces fall within one standard deviation of their expected values, with errors ranging between +3.1% and -3.1%. Again, all 13 provinces have model errors that fall within two standard deviations (+6.2% to -6.2%). There is no rule that errors must fall within two standard deviations, but usually they do. Usually model results look something like Figure 4-12, with most expected values falling within one standard deviation of their observed values (error less than one standard deviation) and the vast majority of expected values falling within two standard deviations of their observed values (error less than two standard deviations). When a model has a small error standard deviation, that means that the model produces good, accurate estimates of the dependent variable.

4.4. Calculating variance and standard deviation (optional/advanced) There is rarely any need to calculate the variance and standard deviation of a variable or of the error in a mean model or linear regression model. Statistical computer programs, spreadsheet programs, and even calculators all are able to calculate standard deviation. On the other hand, unlike calculating regression coefficients, calculating standard deviation is not too difficult. There are six steps in the calculation of the standard deviation of a variable. They are: (1) Calculate the mean of the variable (2) Calculate deviations from the mean for each case of the variable (3) Square these deviations (4) Sum up all the deviations into total squared deviation (5) Divide total squared deviation by the degrees of freedom to arrive at variance (6) Take the square root of variance to arrive at standard deviation These six steps in the calculation of standard deviation are illustrated in Figure 4-13 using data on the number of subway stations in each borough of New York City. Including the 22 stations of the Staten Island Railway as subway stations, there are a total of 490 stations in the five boroughs. Dividing 490 by 5 gives a mean number of subway stations per borough of 98 (Step 1). Each borough's deviation from this mean of 98 stations is given in the table (Step 2). To the right of the deviations are the deviations squared (Step 3). The sum total of these squared deviations is 14434 (Step 4). Since there are five boroughs, and the deviations in Figure 4-13 are deviations from a mean model (not a regression model), there are 4 degrees of freedom (5 - 1 = 4). Dividing the total squared deviation by the degrees of freedom (14434 / 4) gives the variance of the number of subway stations per borough.

Figure 4-13. Calculating the standard deviation of New York City subway stations per borough, 2010

Variance is sometimes used instead of standard deviation as a measure of the spread of a variable. The problem with variance is that it is not very intuitively meaningful. For example, the variance of the number of subway stations in Figure 4-13 is 14434. Since variance is a sum of squared deviations, it is expressed in squared units. As a result, the variance in Figure 4-13 is really 14434 squared stations. Since there's no such thing as a squared station, it makes sense to take the square root of variance. Taking the square root of variance give standard deviation. The standard deviation in Figure 4-13 represents a number of stations. The number of subway stations per borough of New York City has a mean of 98 stations and a standard deviation of 60.1 stations. Calculating regression error standard deviation works exactly the same way as calculating standard deviation, except that the degrees of freedom equal N-2 instead of N-1. This difference in the degrees of freedom is the reason why it is possible (though unlikely) for regression error standard deviation to be greater than the standard deviation from a mean model. The expected values from a regression model are always closer to the observed values of the dependent variable than the expected values from a mean model. This is because the expected values from a regression model vary, while the expected values from a mean model are constant (they're just the mean). Since the regression expected values are closer to the observed values of the dependent variable, their errors (deviations) are smaller, and their squared errors (deviations) are smaller. The degrees of freedom in the regression model, however, are smaller as well (N-2 instead of N-1). It is just possible that the smaller degrees of freedom can offset the smaller squared error to produce a larger variance. As a rule, linear regression models always have less error standard deviation than mean models unless both the slope and the number of cases (N) are very small. When the slope is small, the expected values of the regression model are not very different from the expected values of the mean model: both are constant, or nearly so. When the number of cases is small, the difference in the degrees of freedom can be big enough to matter (the difference between 4 and 3 is much more important than the difference between 4000 and 3999). In practice, this (almost) never happens. Where data are available to calculate expected and predicted values using a regression model, these will (almost) always be better than expected or predicted values from a mean model. A mean model would only be used to make predictions where the data weren't available to use a linear regression model.

4.5. Case study: Income and wage employment in China China has been experiencing extraordinarily rapid rates of economic growth since the late 1990s. Nonetheless, China as a whole is still a relatively poor country. It's average income levels are less than half that of Mexico. One characteristic of poor countries all over the world is that many people live off the land growing their own food instead of working for pay. As incomes rise, more and more people move off the land to seek employment in factories and other workplaces that pay money wages. In China today, millions of people are moving from small farming villages to new urban areas every year, making the transition from subsistence farming to wage labor. Social scientists debate whether people are better off as subsistence farmers or better off as wage laborers, but either way the trend is unmistakable. Millions of Chinese join the ranks of wage laborers every year. Like Canada and Australia, China had more than one kind of administrative division. In China, there are 4 independent municipalities (the biggest cities in the country), 22 provinces, and 5 "autonomous regions" that have large minority populations and have different administrative procedures than regular provinces. There are also two "special administrative regions" (Hong Kong and Macau) that for historical reasons are not included in many Chinese data. In addition, China claims ownership of but does not control the island of Taiwan. All told, most Chinese datasets include variables for the 31 main divisions, excluding Hong Kong, Macau, and Taiwan. A database containing population, income, and employment data for these 31 divisions is reproduced as Figure 4-14.

Figure 4-14. Conditional means of labor force participation rates across Chinese cities, provinces, and regions, 2008

Two variables in Figure 4-14 are particularly interesting for understanding the transition from subsistence agriculture to wage labor. The variable INC$2008 is the mean income level for wage earners in each administrative division, and the variable EMP(%) is the labor force participation rate (the proportion of people in each division who are employed in formal wage labor). Conditional mean levels of income, conditional on the type of division, are plotted in Figure 4-15. The mean income level in each type of division (municipality, province, or region) is reported on the graph. The four municipalities are much richer than the provinces and regions, but there is one relatively poorer municipality, Chongquing, which is inland deep in the middle of China. There is one apparently rich region, Tibet, but in fact Tibet is relatively poor. The very high cost of living in Tibet keeps wages higher than they otherwise might be.

Figure 4-15. Conditional means of labor force participation rates across Chinese cities, provinces, and regions, 2008

Figure 4-16 contrasts two models of labor force participation for the 22 Chinese provinces. Figure 4-16 focuses on the provinces because there are more provinces than other divisions and municipalities and regions are different in many ways from provinces. The left side of Figure 4-16 presents a mean model with 21 degrees of freedom for labor force participation (marked LFP). The mean level of labor force participation across the 22 provinces is 54.4%, with a standard deviation of 6.9%. All provinces except Zhijiang have labor force participation rates that fall within two standard deviations of the mean. The right side of Figure 4-16 presents a linear regression model with 20 degrees of freedom that regresses labor force participation (dependent variable) on mean income level (independent variable). The parameters of this model are reported in Figure 4-17.

Figure 4-16. Mean and regression models of labor force participation rates across the 22 Chinese provinces, 2008
Figure 4-17. Regression of labor force participation on provincial mean income for 22 Chinese provinces, 2008

The regression model slope of 6.3 implies that for every $1000 rise in wage rates, the expected value of the labor force participation rate rises by 6.3%. The regression error standard deviation of this model is 3.4%, which is less than half the mean model standard deviation of 6.9%. The strong positive slope and the low level of error in the regression model suggest that the regression model provides a much better representation of labor force participation than the mean model. Labor force participation in Chinese provinces does seem to rise at least in part in line with rising wage incomes.

Chapter 4 Key Terms[edit | edit source]

  • Conditional means are the expected values of dependent variables for specific groups of cases.
  • Degrees of freedom are the number of errors in a model that are actually free to vary.
  • Mean models are very simple statistical models in which a variable has just one expected value, its mean.
  • Means are the expected values of variables.
  • Parameters are the figures associated with statistical models, like means and regression coefficients.
  • Regression error standard deviation is a measure of the amount of spread in the error in a regression model.
  • Standard deviation is a measure of the amount of spread in a variable, which is the same thing as the amount of spread in the error in a mean model.

Chapter 3 · Chapter 5