Social Statistics, Chapter 8: Standardized Coefficients

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Standardized Coefficients[edit | edit source]

We all know that when we drive, our cars pollute the atmosphere. We can literally see, feel, hear, and smell the pollution coming out of our exhaust pipes. Of course, cars aren't the only source of air pollution. Almost everything we do in modern society causes air pollution, and in particular the emission of carbon dioxide (CO2) into the atmosphere, which cases global warming. Shopping causes CO2 emissions because of all the energy used to get products into the stores in the first place. Sending an e-mail (CO2) emissions because of the electricity used by our computers. Eating causes CO2 emissions because of the electricity and gas used by our kitchen appliances. Even sleeping causes CO2 emissions if we use air conditioning or heating in our bedrooms, listen to music, or set an alarm clock. There's no way around it: simply living in modern society pollutes the air and contributes to global warming. In Chapter 3, countries' levels of CO2 emissions were regressed on their numbers of passenger cars in a simple linear regression model of CO2 emissions. Given that CO2 emissions come from multiple sources, not just passenger cars, it seems like a multicausal model would be more appropriate. Ideally, a multicausal model of CO2 emissions would include all sorts of variables, like electricity usage, air, rail, and truck transportation, agricultural CO2 emissions, and other sources of pollution. The inclusion of all of these variables in a multiple linear regression model would help create a model that might predict CO2 emissions very accurately, since all the contributors to CO2 emissions would be accounted for. A much simpler multicausal model of CO2 emissions is presented in Figure 8-1. Using data from Figure 3-1, tons of CO2 emissions per person in 51 countries are regressed on three predictors: passenger cars per 1000 people, national income per person, and western European status. Theoretically, greater numbers of passenger cars and higher overall national income per person should be associated with higher levels of CO2 emissions. Western European status should be associated with lower levels of CO2 emissions, since western European countries have been very active in promoting action on climate change in recent years. Three models are presented in Figure 8-1: a simple linear regression model based on passenger cars (Model 1), a multiple linear regression model that adds national income (Model 2), and an additional multiple linear regression model that also adds European status (Model 3).

Figure 8-1. Regression of carbon dioxide (CO2) emissions on passenger cars and national income, 2005 (data from Figure 3-1)

From the perspective of passenger cars as an independent variable of interest, national income is a control variable. The slope associated with passenger cars is 0.013 in Model 1, but after controlling for national income the slope declines to 0.008. This decline means that national income is a competing control from the perspective of passenger cars: it competes with passenger cars in explaining the same facts about CO2 emissions. This is not surprising. After all, driving a car is an integral part of living life in a rich country. When countries have higher levels of national income per person, they have more cars. As a result, the two variables compete in explaining CO2 emissions in a country. For national income to complement passenger cars, it would have had to explain something else about a country that might have obscured the true relationship between passenger cars and CO2 emissions. Something that does complement both passenger cars and national income is European country status. European country status helps bring out the true relationships between passenger cars and CO2 emissions and between national income and CO2 emissions because European countries have relatively low emissions levels for their passenger car and national income levels. In western Europe, passenger cars tend to be small and fuel-efficient, so western European countries have lower levels of CO2 emissions for the same number of cars. Similarly, western Europe relies heavily on nuclear and wind power, resulting in lower CO2 emissions for the same level of national income. As a result, European status is a complementary control from the perspective of both passenger cars and national income. It complements national income especially strongly. Which variable has a stronger overall effect on CO2 emissions, the number of passenger cars or the level of national income more broadly? It's hard to say for sure, because the variables are expressed in very different units. An increase of one car per 1000 people has a smaller effect than an increase of $1000 in national income, but it's not obvious that this is a fair comparison. After all, an increase of 100 cars per 1000 people would have a much bigger effect than an increase of $1000 in national income. We could compare the t statistics for passenger cars and national income to determine which one is more significant statistically, but we don't yet have a mechanism for comparing the absolute sizes of the effects of different variables on the same scale. Without this, we can't determine which variable has a bigger impact on CO2 emissions. Going one step further, how much of the international variability in CO2 emissions is explained by these two variables taken together, or by these two variables plus European status? After all, the ultimate goal of statistical modeling is to explain things about the world. The models presented in Figure 8-1 partially explain country differences in CO2 emissions, but we don't know how much they explain CO2 emissions. This would be a very nice thing to know, certainly from a scientific standpoint but also from the standpoint of social policy. Politicians and the public have a legitimate desire to know just how much of the wide variability in the real world can be explained by our statistical models. If we can explain 90% of the observed variability in the world, great. If we can explain 50%, alright, at least that's something. If we can only explain 10% of the observed variability in the world and the rest of reality is fundamentally random, the world has a legitimate right to ignore us and our statistical results.

This chapter shows how the relative explanatory power of different variables can be directly compared in the context of multiple regression models. First, in order to compare variables, they have to be put on the same scale through a process called standardization (Section 8.1). Regression slopes estimated using standardized variables can be used to directly compare the relative explanatory power of each variable. Second, when standardized variables are used in a simple linear regression the result is a correlation coefficient that measures the strength of the relationship between the variables (Section 8.2). The correlation conveniently measure the strength of the relationship between two variables on a scale from -1 to +1. Correlation coefficients can also be used to evaluate the overall predictive power of any regression model (Section 8.3). The total variability in the dependent variable that is explained by the independent variables can also be measured. An optional section (Section 8.4) illustrates how correlation coefficients can be used to explore the structure of the relationships connecting a large number of variables at the same time. Finally, this chapter ends with an applied case study of people's satisfaction with democracy in Taiwan (Section 8.5). This case study illustrates how standardized regression coefficients are used to evaluate and compare the performance of different variables in multiple regression models. It also illustrates how the relative explanatory power of different regression models can be compared. All of this chapter's key concepts are used in this case study. By the end of this chapter, you should be able to use multiple regression to make basic inferences about the effects of multiple independent variables on a single dependent variable.

8.1. Comparing like with like using standardized variables Linear regression involves comparing the values of variables that mean very different things. In Model 2 of Figure 8-1 it is reported that every 1 additional passenger car per 1000 people in a country is associated with an increase of 0.008 tons of CO2 emissions per person, while every $1000 in national income is associated with an increase of 0.074 tons of CO2 emissions. The variable passenger cars is expressed as a number per 1000 people, the variable national income is expressed in thousands of dollars, and the variable CO2 emissions is expressed as tons per person per year. The use of so many different units makes it hard to compare countries across different variables. For example, Australia has 542 cars per 1000 people, a national income of $29,480 per person, and CO2 emissions of 18.1 tons per person. Are these figures high, low, or about average? How do they compare to each other, and to other countries? One way to judge whether a value of a variable is high or low is by using means and standard deviations. Figure 8-2 reports the means and standard deviations of CO2 emissions, passenger cars, and national income for the 51 countries used in Figure 8-1. Australia's CO2 emissions (18.1 tons) are well above the mean level for CO2 emissions across the 51 countries. In fact, Australia's CO2 emissions are more than 2 standard deviations above the mean. The mean is 6.15 tons. The mean plus one standard deviation is 6.15 + 4.34 = 10.49 tons. The mean plus two standard deviations is 6.15 + 4.34 + 4.34 = 14.83 tons. Australia, with emissions of 18.1 tons, is far above this level.

Figure 8-2. Means and standard deviations of carbon dioxide (CO2) emissions, passenger cars, and national income, 2005 (data from Figure 3-1)

Australia also has a large number of cars per 1000 people, 542 cars versus a cross-national mean of 242 cars. Australia's level is more than 1 standard deviation above the mean, but not quite 2 standard deviations. Australia's national income level of $29,480 is well above the mean, but by a little less than 1 standard deviation. It's actually just 0.83 standard deviations above the mean. So while Australia is around 1 standard deviations above the mean on passenger cars and national income, it's more than 2 standard deviations above the mean on CO2 emissions. This suggests that CO2 emissions are especially high in Australia. Looking back at Figure 3-2, this seems to be the case: Australia's observed CO2 emissions are much higher than its expected levels as indicated by the regression line. Australia is a clear outlier. On the other hand, Mexico is 0.47 standard deviations below the mean on CO2 emissions, 0.51 standard deviations below the mean on passenger cars, and 0.42 standard deviations below the mean on income. It is below the mean by roughly the same amount on all three variables. It is also much closer the mean on all three variables than Australia is, at least in terms of standard deviations. For example, where Mexico is 0.47 standard deviations below the mean on CO2 emissions, Australia is 2.75 standard deviations above the mean. Discussing variables in terms of standard deviations above or below the mean is actually a very useful technique. Instead of computing standard deviations above or below the mean for particular cases, it makes more sense to convert all the values of a variable into standard deviations above or below the mean at the same time in a process called "standardization." Standardized variables are variables that have been transformed by subtracting the mean from every observed value and then dividing by the standard deviation. Because of the way they are constructed, standardized variables always have a mean of 0 and a standard deviation of 1. Also because of the way they are constructed, standardized variables have no units. For example, Mexico is 0.42 standard deviations below the mean on national income. It is not 0.42 dollars below the mean or 0.42 percent below the mean. It is just 0.42 standard deviations below the mean. This is the only downside of standardized variables: they have no units. When social statisticians are interested in doing something that requires the original units, like using a regression model to make predictions, they need the original units. Unstandardized variables are variables that are expressed in their original units. All variables start out as unstandardized variables, expressed in dollars, pounds, percentages, cars per 1000 people, or some other unit. In general, "unstandardized variables" are just "variables." The term "unstandardized variable" is only used to when it is necessary to distinguish an unstandardized variable from a standardized variable. The most important use of standardized variables is in linear regression models. In Figure 7-13, smoking rates in the 13 Canadian provinces and territories were regressed on average daily temperatures and rates of heavy alcohol drinking. Both temperature and heavy drinking were statistically significant predictors of smoking rates, but since they are recorded in different units (degrees versus percents) their effects were difficult to compare. Standardizing all three variables (smoking, temperature, and drinking) makes possible a meaningful comparison of the effects of temperature and drinking on smoking rates. The first step in standardizing variables is to find their means and standard deviations. Means and standard deviations for all three variables are reported in Figure 8-3. The second step is to use these means and standard deviations to convert each value of each variable into a standardized value.

Figure 8-3. Means and standard deviations for smoking rates, average temperature, and heavy drinking in 13 Canadian provinces and territories, 2008 (data from Figure 4-8)

Normally, standardization is performed automatically by statistical computer software, but the steps in the process of standardization are illustrated in Figure 8-4 for the variable "heavy drinking." All 13 Canadian provinces and territories are listed in Column 1 and their heavy drinking rates are reported in Column 2. The mean value of heavy drinking is recorded in Column 3. The difference between each value and the mean is calculated in Column 4. The standard deviation of heavy drinking is recorded in Column 5. The standardized value of heavy drinking for each province is calculated in Column 6. The values in Column 6 represent how many standard deviations each province is from the mean level of heavy drinking for all provinces.

Figure 8-4. Illustration of how variables are standardized, using the example of smoking in 13 Canadian provinces and territories, 2008

Standardization changes the value of each case in a variable into a standardized value, but it does not change the distribution of a variable in any way. High cases are still high cases and low cases are still low cases. It's only the units that change. This is illustrated in Figure 8-5. The left side of Figure 8-5 is a scatter plot of smoking rates versus heavy drinking rates for the 13 Canadian provinces and territories using the original, unstandardized variables. The right side of Figure 8-5 depicts the same relationship using standardized variables. The layout of the points is identical, but the scales of the axes change. The standardized plot is centered on 0 and runs from a low of -3 to a high of +3 in each direction. Notice also how the regression line depicted on the standardized plot runs right through the center point. This happens with any regression estimated using standardized variables.

Figure 8-5. Comparison of scatter plots of smoking rates versus heavy drinking rates for unstandardized (left) versus standardized (right) variables across 13 Canadian provinces and territories

The coefficients of a multiple linear regression model regressing the provincial smoking rates on average temperatures and heavy drinking rates are reported in Figure 8-6. Two sets of coefficients are reported, one set using the unstandardized variables and one using the standardized variables. Unstandardized coefficients are the coefficients of regression models that have been estimated using original unstandardized variables. Standardized coefficients are the coefficients of regression models that have been estimated using standardized variables. The unstandardized coefficients are just the regular coefficients found using the three variables in their original units. They are the same as the coefficients that were reported in Figure 7-13, Model 2. The standardized coefficients, on the other hand, are the coefficients of the model that result when the new, standardized versions of the variables are used.

Figure 8-6. Regression of smoking rates on temperatures and heavy drinking rates across 13 Canadian provinces and territories, 2008 (after Figure 7-13, Model 2)

Since the standardized coefficients are both expressed in standardized units (in terms of standard deviations, not degrees or percentages), they are directly comparable. The relationship between temperature and smoking is more than twice as strong as the relationship between heavy drinking and smoking. Every one standard deviation increase in temperature is associated with a 0.832 standard deviation decline in the smoking rate, while every one standard deviation increase in heavy drinking is associated with a 0.398 standard deviation increase in the smoking rate. No intercept is reported for the standardized model because the intercept for standardized models is always 0. It would not be incorrect to report 0 as the intercept, but it is customary to leave the intercept blank for standardized models. Because of this custom, it is easy to tell at a glance whether a model uses unstandardized or standardized variables. Whenever an intercept is reported, the model is unstandardized. When no intercept is reported, the model must be standardized. The t statistics and probability levels are the same for the unstandardized and standardized coefficients. Since standardization doesn't change anything about the layout of the points in an analysis or the amount of error, it doesn't have any effect on significance levels. The standardized coefficients for the regressions models for CO2 emissions are reported in Figure 8-7. In Model 2, the standardized coefficient for passenger cars is larger than the standardized coefficient for national income. This indicates that the number of passenger cars predicts CO2 emissions more strongly than does the level of national income. The difference between the two standardized coefficients is small but clear. Controlling for western European status in Model 3 changes the situation. In Model 3, the standardized coefficient for national income is stronger than the coefficient for passenger cars. This happens because controlling for western European status removes a lot of the error from the relationship between national income and CO2 emissions (western European status strongly complements national income).

Figure 8-7. Comparison of metric and standardized coefficients for the regression of carbon dioxide (CO2) emissions on passenger cars and national income, 2005 (after Figure 8-1)

So which variable most strongly predicts CO2 emissions? The answer is that their effects are about equal. Which one is stronger depends on whether or not we control for western European status. Western European status itself has a much weaker effect on CO2 emissions, a little more than half as strong as either of the other variables. The use of standardized variables to produce standardized coefficients makes all these comparisons possible.

8.2. Correlation In Figure 8-6 and Figure 8-7, standardized coefficients are used to compare the relative strengths of relationships within a model. The relationship between temperature and the smoking rate is much stronger than the relationship between heavy drinking and the smoking rate across Canadian provinces. The relationship between passenger cars and CO2 emissions is roughly the same strength as the relationship between national income and CO2 emissions across countries. In each of these cases, the strengths of different relationships are compared within a single regression model. Standardized coefficients can also be used to compare the strengths of relationships across models. For example, in Chapter 1 it was theorized that low incomes led to higher junk food consumption. This theory was operationalized into two specific hypotheses. First, it was hypothesized that US states with higher median incomes would have less soda consumption. Second, it was hypothesized that states with higher incomes would have less sweetened snack consumption. It turned out that the first hypothesis was correct, but the second was wrong. Contrary to expectations, states with higher incomes actually had more sweetened snack consumption, not less. We might want to know which relationship was stronger, the (expected) relationship between income and soda or the (unexpected) relationship between income and sweetened snacks. The results of regressions of each dependent variable on state median income are reported in Figure 8-8 and Figure 8-9. The unstandardized coefficients in each model are the same coefficients as were reported in Chapter 1. The unstandardized coefficients indicate that each additional $1000 in state median income is associated with a 0.603 gallon decline in soda consumption per person per year and a 0.611 pound increase in sweetened snack consumption per person per year. The standardized coefficients indicate that the strength of the relationship between income and soda consumptions is actually slightly stronger than the strength of the relationship between income and sweetened snack consumption.

Figure 8-8. Comparison of metric and standardized coefficients for regression of soda consumption on state median income for 48 US states, 2008 (data from Figure 1-2)
Figure 8-9. Comparison of metric and standardized coefficients for regression of sweetened snack consumption on state median income for 48 US states, 2008 (data from Figure 1-2)

Standardized coefficients from simple linear regression models are often used in this way to gage the strengths of the relationships between variables. Standardized coefficients are a convenient shorthand for the relationship between two variables for several reasons. First, since they are based on standardized variables, they are always comparable, no matter what units the original variables were measured in. Second, it turns out that in simple linear regression models with one predictor, it doesn't matter which variable is the dependent variable and which is the independent variable. Either way the standardized coefficient will be the same. Third, again only in simple linear regression models with one predictor, it turns out that the standardized coefficient will always fall somewhere between -1 and +1. The slope is never larger than 1 in either direction. This use of standardized coefficients from simple regression models is so common that it has its own name and symbol: correlation, denoted using the symbol "r." Correlation (r) is a measure of the strength of the relationship between two variables that runs from r = -1 (perfect negative correlation) through r = 0 (no correlation) to r = +1 (perfect positive correlation). The correlation between two variables is exactly the same thing as the standardized coefficient from a regression of one of the variables on the other variable. Figure 8-10 demonstrates how the correlations of state income with soda and sweetened snacks might be compared. Notice how the correlation coefficients are identical to the simple linear regression coefficients reported in Figure 8-8 and Figure 8-9. Similarly, their probabilities correspond to those from the t statistics reported in Figure 8-8 and Figure 8-9.

Figure 8-10. Correlation of state median income with soda and sweetened snack consumption for 48 US states, 2008

8.3. R and R2 In addition to gaging the strengths of the relationships between variables in general, correlations have another, very specific use. Correlations can be used to evaluate the overall predictive strength of any regression model. For example, the table in Figure 8-11 repeats the results of the regression of the gender gap in science scores on three independent variables (national income, educational spending, and number of teachers) from Figure 7-6. In Figure 8-11 both unstandardized and standardized coefficients are reported. The standardized coefficients indicate that the effects of teachers and national income are stronger than the effects of educational spending, but they don't tell us how well the model as a whole does in predicting the gender gap in science.

Figure 8-11. Multiple linear regression of the gender gap in science scores on various independent variables, 2006 (after Figure 7-6; data from Figure 7-1)

Correlations can be used to help shed light on this. The key question in evaluating the performance of a model is: how well do the values of the dependent variable predicted by the model correspond to the actually observed values of the dependent variable? In other words, what is the strength of the relationship between the dependent variable's expected values and its actual values? A correlation is ideal for measuring the strength of the relationship between two variables. When the two variables being correlated are the actual and expected values of a dependent variable from a regression model, the correlation is represented by the capital letter R (to distinguish it as a specific type of correlation). For Model 1 of Figure 8-11, R is 0.446. This means that in Model 1 of Figure 8-11 the actual values of the gender gap in science are correlated r = 0.446 with the expected values generated by the regression model. This is not ideal (a perfect correlation would be r = 1), but it's something. Figure 8-12 plots the actual values of the gender gap in science scores against the expected values generated by Model 1 of Figure 8-11. There is a definite positive relationship between the two, but it is not very strong. By way of comparison, for the regression predicting smoking rates across Canadian provinces (Figure 8-6) the model's predictive performance was R = 0.928, while for the regression predicting CO2 emissions across countries (Model 3 of Figure 8-7) the model's predictive performance was R = 0.632. The regression model in Figure 8-11 does predict the gender gap in science scores, but not especially well.

Figure 8-12. Scatter plot of actual versus expected values of the gender gap in science scores based on Model 1 of Figure 8-10

In addition to summarizing the predictive strength of a regression model, the regression model R statistic also has one more very important use. The R statistic plays a role in linking regression models back to mean models. In Chapter 4, Figure 4-10 showed how the mean model for smoking rates across Canadian provinces could be mapped into the regression model for smoking. Each province's deviation from the mean smoking rate (left side of Figure 4-10) was spread out over the range of average temperatures in the regression model for smoking (right side of Figure 4-10). In Figure 8-13, the same kind of mapping is done for the relationship between smoking rates and heavy drinking rates. Again, each province's deviation from the mean smoking rate (left side) is spread out over the range of the independent variable, which in this case is heavy drinking (right side).

Figure 8-13. Illustration of mean and regression models of smoking rates across the 13 Canadian provinces and territories, 2008

Viewed in this way, a regression model is nothing more than an explanation of why certain cases deviate from the mean more than other cases do. In this view, Ontario has a low smoking rate at least in part because it has a low drinking rate, while the Northwest Territories have a high smoking rate at least in part because they have a high drinking rate. These provinces' smoking rates are also explained in part by their temperatures (Figure 4-10). Together, temperatures and heavy drinking rates explain a large portion of the total cross-province variability in smoking rates, but how much? Answering this question properly requires a lot of algebra, but the end result of all that algebra is that the proportion of the total variability in the dependent variable that is explained by the independent variables in a regression model is equal to R x R, or R2 ("R squared"). R2 is a measure of the proportion of the total variability in the dependent variable that is explained by a regression model. As with other regression-related statistics, there is no need to calculate R2. Any software program that estimates regression models will automatically report the R2 statistic for the model. The R2 for the model predicting the gender gap in science scores in Figure 8-11 is 0.199 or 19.9%, indicating that together national income, educational spending, and the number of teachers in a country explain almost 20% of the international variability in the gender gap in science. The R2 for the model predicting Canadian provincial smoking rates (Figure 8-6) is 0.861, indicating that over 86% of the variation across provinces in smoking rates can be explained by differences in temperatures and heavy drinking. The R2 for the full model predicting CO2 emissions across countries (Model 3 of Figure 8-7) is 0.399, indicating that nearly 40% of international differences in CO2 emissions are explained by differences in national income, passenger cars, and western European status. Since R2 has such an important intuitive meaning (proportion of variability explained), most tables of regression results report R2, not R. Since by definition R2 equals R squared, it is easy to calculate one from the other (if necessary) using a calculator. In practice, R is rarely used, but R2 is reported and discussed for almost every regression model.

8.4. Correlation matrices (optional/advanced) Correlations are summary measures of the strengths of the relationships between variables. As such, they are useful even outside the context of regression models. Sometimes we just want to know what relationships exist among a group of variables. A table of all the correlations connecting a group of variables is called a correlation matrix. For example, the correlation matrix for the four variables included in Figure 8-6 (the regression model for CO2 emissions) are reported in Figure 8-14. The dependent variable from the regression model is listed first in Figure 8-14, followed by all the independent variables, but the order of variables in a correlation matrix doesn't have any effect on the correlations. The table or matrix is just a convenient way to organize all the correlations among all the variables.

Figure 8-14. Correlation matrix of variables from Figure 8-6 (CO2 emissions regression)

Correlations matrices like the one depicted in Figure 8-14 always have the same number of rows as columns, because every variable is listed both as a row and as a column. Where a variable is matched with itself in the table, the correlation is always 1 (every variable is perfectly correlated with itself). As a result, correlations matrices always have a diagonal of 1's running through the middle. Another feature of correlation matrices is that the correlations reported in the upper right corner are the mirror image of the correlations reported in the lower left corner. This happens because correlations are symmetrical: the correlation of income with cars is the same as the correlation of cars with income. Since the entries in the upper right corner repeat the entries in the lower left corner, the upper right corner is sometimes left blank in correlation matrices. These half-blank matrices with the redundant correlations removed are known as triangular matrices. A triangular correlation matrix with seven variables is depicted in Figure 8-15. These correlations are based on the characteristics of 1145 Taiwanese respondents to the 2006 World Values Survey (to be discussed in Section 8.5). Due to the long names of some of the variables, each variable is listed on a numbered row, and the numbers are used as headings for the columns. Means and standard deviations for all of the variables have also been included for easy reference. Correlations matrices with means and standard deviations summarize all of the important features of a group of variables in a compact space. Though many authors do use triangular matrices like the one depicted in Figure 8-15, full (square) matrices are much more convenient. It can take a few minutes to find the correlation you're looking for in a triangular matrix, but in a square matrix you can always read across a row to find any correlation you want.

Figure 8-15. Correlation matrix including descriptive statistics for variables included in the Taiwan democracy regression (Figure 8-16)

Correlation matrices can be used both to pick out which independent variables are likely to be significantly related to the dependent variable. For example, in Figure 8-15 the variables that are most closely related to the democracy rating are age, education, and confidence in institutions. Correlations matrices can also be used to pick out which independent variables are likely to compete with or complement each other based on what independent variables are correlated with each other. Highly correlated groups of variables like age, education, and income, or trust and confidence, are likely to compete with or complement each other. Variables that are not correlated with each other are unlikely to be either competing or complementary when used in a regression model. Differentiating between competing and complementary controls based on the correlation matrix, however, is difficult or impossible.

8.5. Case study: Satisfaction with Taiwan's democracy In Chapter 5, data from the Taiwan edition of the 2006 World Values Survey (WVS) were used to study the relationship between people's ages and their ratings of the quality of democracy in Taiwan. People's rating of democracy in Taiwan was scored on a scale from 0 to 100 where: Rating = 0 means the respondent thinks there is not enough democracy in Taiwan Rating = 50 means the respondent thinks there is just the right amount of democracy in Taiwan Rating = 100 means the respondent thinks there is too much democracy in Taiwan In a regression model reported in Figure 5-7, age was found to be positively related to people's rating of Taiwan's democracy, with each additional year of age predicting a 0.105 point increase in the democracy rating (older people thought Taiwan was more democratic than younger people). The mean democracy rating among the 1216 people studied was 38.7, indicating the most people were less than satisfied with the amount of democracy in Taiwan. The results of two more extensive multiple regression models for the democracy rating in Taiwan are reported in Figure 8-16. These models are based on only 1145 survey responses because some people didn't answer all of the questions used in the models. The models reported in Figure 8-16 include six independent variables: Age -- the respondent's age in years Gender -- the respondent's gender, coded as Female=0 and Male=1 Education -- the respondent's yeas of education Income -- the respondent's income decile (lowest tenth, second tenth, third tenth, etc.) Trust in society -- a variable measuring the respondent's level of trust in society on a scale from 0 to 18 Confidence in institutions -- a variable measuring the respondent's level of confidence in the institutions of society (like government, corporations, and churches) on a scale from 0 to 45 It is expected that education and income would be positively related to people's ratings of democracy, since people who have been more successful at advancing in society usually have a higher opinions of that society. Similarly, those who have a higher level of trust and confidence in their society's institutions would be expected to rate their society's democracy more highly. Gender is also included as a control variable, but with no particular expectation of its effects.

Figure 8-16. Regression of citizens' democracy ratings in Taiwan on six independent variables, 2006 (N=1145)

The first column of results in Figure 8-16 reports the correlations of the democracy rating with each of the six independent variables. Only three of these correlations are statistically significant: those for age, education, and confidence in institutions. Figure 8-16 also reports the results of two regression models. For each model, both unstandardized coefficients and standardized coefficients are reported. The unstandardized coefficients are just the ordinary regression coefficients based on the unstandardized variables for each of the variables in the model. The standardized coefficients are the regression coefficients that result from the same regression model estimated using standardized variables. The standardized coefficients in Model 1 indicate that age has the most powerful effect on the democracy rating of any of the four variables included in the model. In Model 2, confidence in the institutions of society (including government) is positively associated with people's ratings of their country's democracy. This makes sense, since people who have no confidence in government would be unlikely to rate their government as being very democratic. Surprisingly, trust in society has no significant effect on people's ratings of how democratic their government is in Taiwan. The R2 statistics for both models are shockingly low. The R2 of 0.019 for Model 1 indicates that the variables in Model 1 altogether explain only 1.9% of the total variability in people's ratings of Taiwan's democracy. The R2 rises as variables are added in Model 2, but at 0.026 (2.6%) it is still very low. These low R2 scores call into question the substantive significance of the two models. Both models have statistically significant coefficients, but a model that explains less than 3% of the total variability in the dependent variable may not be very useful from a policy standpoint. The real reasons behind people's ratings of the level of democracy of their own government remain a mystery, at least in Taiwan.

Chapter 8 Key Terms[edit | edit source]

  • Correlation (r) is a measure of the strength of the relationship between two variables that runs from r = −1 (perfect negative correlation) through r = 0 (no correlation) to r = +1 (perfect positive correlation).
  • R2 is a measure of the proportion of the total variability in the dependent variable that is explained by a regression model.
  • Standardized coefficients are the coefficients of regression models that have been estimated using standardized variables.
  • Standardized variables are variables that have been transformed by subtracting the mean from every observed value and then dividing by the standard deviation.
  • Unstandardized coefficients are the coefficients of regression models that have been estimated using original unstandardized variables.
  • Unstandardized variables are variables that are expressed in their original units.

Chapter 7 · Chapter 9