Social Statistics, Chapter 10: Multiple Categorical Predictors: ANOVA Models

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Multiple Categorical Predictors: ANOVA Models[edit | edit source]

In the Twentieth Century Germany invaded its neighbors in two tragically destructive wars. World War I (1914-1919) and even more so World War II (1939-1945) left Europe in ruins. In the seven years of World War II in Europe, roughly 18 million soldiers and 25 million civilians lost their lives, including 6 million Jews systematically killed in the Holocaust. These deaths amount to around 7% of the total population of Europe at the time, or about 1 out of every 13 people. It is difficult for us today to comprehend the scale of the losses. After the slaughter of World War II, European leaders were adamant that Europe should never go to war again. In 1951 six western European countries signed a treaty that set up the European Coal and Steel Community. In the sixty years since then, this limited international partnership has evolved into a European Union (EU) of 27 countries with a combined population of over 500 million people and an economy larger than that of the United States. Today's EU is dedicated to supporting peace, development, democracy, and human rights across Europe. It does this through economic cooperation and collective decision-making in which the approval of all 27 countries is required for major decisions. Considering its 60-year track record of preventing war and promoting prosperity across Europe, we might expect European citizens to have a high level of confidence in the EU. In fact, they do not. Over the years 2005-2008 the World Values Survey (WVS) was conducted in eleven EU countries. Two additional countries (Bulgaria and Rumania) participated in the WVS in 2006 and joined the EU one year later in 2007. In all thirteen countries, survey respondents were asked how much confidence they had in the European Union. Placed on a scale from 0 to 3, the available answers were: 0 -- None at all 1 -- Not very much 2 -- Quite a lot 3 -- A great deal The overall mean answer across the thirteen countries was 1.355, much closer to "not very much" than to "quite a lot." Excluding the two countries that had not yet joined by the time of their surveys, the mean was 1.31. In the United Kingdom, which has been an EU member since 1973, the mean level of confidence in the EU was just 1.03. In most countries, people expressed more confidence in "television" than in the EU.

Figure 10-1. Mean confidence in European Union institutions, 2005-2008 (N = 13 countries)

On the other hand, all countries are not alike. People in some countries (like Italy and Spain) did express reasonably high levels of confidence in the EU. The observed mean level of confidence in the EU in the United Kingdom was 1.03 with a standard error of 0.027, so the true mean in the UK is probably somewhere between 0.98 and 1.08 (the observed mean plus or minus two standard errors). Similarly, the true mean for Italy is probably somewhere between 1.68 and 1.76. The true mean for Italy is almost certainly higher than the true mean for the United Kingdom. In Chapter 6 we learned how to use t statistics to evaluate the statistical significance of mean differences between groups, and in fact the difference between the United Kingdom and Italy is highly significant (t = 19.653 with 1850 degrees of freedom, giving a probability of .000 that the difference could have arisen by chance). How big are the cross-national differences in confidence in the European Union overall? One way to answer this is to ask whether or not the cross-national differences are statistically significant. We can use simple regression models to evaluate the significances of the mean differences between any two groups, but with 13 countries there are 78 distinct pairs of countries that could be compared. Obviously, we wouldn't want to compare all of them. We could instead compare each of the countries to the overall European mean. This is a more reasonable strategy, but it still requires thirteen separate comparisons that then have to be combined somehow. A completely different strategy would be to pick one country (like the United Kingdom) and compare all the other countries to it. This approach wouldn't provide significance levels for every possible comparison of countries, but it would be a start. If the twelve other countries were compared to the United Kingdom within a single multiple linear regression model, the R2 score would give some indication of how much of the total individual variability among respondents in confidence in the EU could be accounted for by differences between countries. This approach would have the added benefit that we already know how to use other variables multiple linear regression. Between-country differences could be included in larger regression models to explain dependent variables using both group memberships and ordinary independent variables.

This chapter introduces a new type of regression model design, the ANOVA model. First, mean models can be used to study how means differ across groups of cases (Section 10.1). The main limitation of the mean model is that it can only be used to examine each group individually, when what we really want to do is to consider all group differences together at the same time. Second, regression models can be used to determine the statistical significance of the differences between groups (Section 10.2). Importantly, they also tell us the total percentage of the variability in the dependent variable that can be traced to group differences. Third, ANOVA models can be embedded into larger regression models to create models that combine aspects of each (Section 10.3). These mixed models are interpreted no differently from ordinary regression models. An optional section (Section 10.4) develops a new statistic, the F statistic, that can be used to evaluate the overall statistical significance of an ANOVA model or any other regression model.. Finally, this chapter ends with an applied case study of racial differences in education levels in the United States (Section 10.5). This case study illustrates how group differences in he values of a dependent variable can be explained using mixed models. All of this chapter's key concepts are used in this case study. By the end of this chapter, you should understand how regression can used to study between-group differences in the values of an independent variable.

10.1. Comparing group differences using mean models The natural way to think of group differences in the values of a variable is to think of each group as having a mean of its own. For example, in Figure 10-1 each of the thirteen countries has its own mean level of confidence in the EU. The 885 Bulgarians have a mean of 1.63, the 1,049 Cypriots have a mean 1.41, and so on down the list. This is the way people usually look at group means, and it is very intuitive. The problem with this approach is that it isn't necessarily an accurate reflection of reality. In reality, individual Europeans report more or less confidence in the EU for a variety of reasons. One of those reasons is the country in which a person lives. Another might be the person's age. More potential reasons include the person's gender, the person's level of education, the person's income, and whether or not the person has ever received funding or other assistance directly from the EU. Confidence in the EU as reported on the WVS can even be attributed in part to whether or not it was a sunny day when the question was asked: people are more likely to rate things highly on a sunny day than on a cloudy day. For all these reasons, a more appropriate way to think about group means is to think of them as group differences in the values of a single variable. The simplest way to do this is using a mean model (Chapter 4). Applying this approach to the European data, there is an overall mean level of confidence in the EU for all Europeans. People's actually reported levels of confidence at any particular time differ from this true mean for a wide variety of reasons. One of those reasons is their country of residence. The mean level of confidence in the EU for each country depends on the specific history and circumstances of that country. Mean models can be used to evaluate whether or not the true means for each country are significantly different from the overall mean level of confidence in the EU for all Europeans. In Chapter 6 we learned how to use the t statistic to evaluate the probability that a true mean might be zero. We can just as easily evaluate the probability that a true mean might be any number. In this situation, we want to know the probability that the true mean level of confidence in the EU in each country might be equal to the overall European mean. The observed mean across all European WVS respondents (14,154 individuals divided into 13 countries) was 1.355. Using the means for each country and their standard errors, t statistics can be calculated to evaluate how far each one is from the European mean. These t statistics and their associated probability levels are reported in Figure 10-2.

Figure 10-2. A mean model approach to evaluating group differences in confidence in European Union institutions, 2005-2008 (N = 14,154 individuals divided into 13 countries)

Based on the results reported in Figure 10-2, mean levels of confidence in Poland (mean = 1.39) and Cyprus (mean = 1.41) are not significantly different from the overall European mean, while the means for all other countries are highly significantly different from the overall European mean. Since some countries are significantly above the European mean and other countries are significantly below the European mean, we can conclude that overall there are important cross-national differences between European countries in confidence in the EU. In this example, nearly all of the t statistics are so highly significant that it is easy to see that the differences between countries matter. In other situations, things might not be so clear. For example, we might want to know whether or not there are regional differences in countries' success in immunizing young children against disease. An international database of rates of diptheria-pertussis-tetanus (DPT) immunization was described in Chapter 9. The database includes data for 100 poor countries organized into six regions. The distribution of the 100 countries in the database by official World Bank region is: EAP -- East Asia & Pacific (14 countries) ECA -- Eastern Europe & Central Asia (19 countries) LAC -- Latin America & Caribbean (9 countries) MNA -- Middle East & North Africa (8 countries) SAS -- South Asia (8 countries) SSA -- Sub-Saharan Africa (42 countries) The DPT vaccine is given to infants aged 12-23 months old. The mean immunization rate for infants these ages across all 100 countries is 81.2%. In five of the six World Bank regions the immunization rate is higher than 81.2%, but in one region (Sub-Saharan Africa) it is lower. The regional mean DPT immunization rates are plotted in Figure 10-3. The 100-country mean has also been placed on the chart as a reference line.

Figure 10-3. Bar chart of regional differences in DPT immunization rates, 2005 (N = 100 poor countries)

The mean model approach that was can be used to study cross-regional differences in DPT immunization rates. In the DPT example, the cases are countries and the countries are grouped into regions. In Figure 10-4, regional deviations from the overall 100-country mean of 81.23% are evaluated using t statistics. Of the six World Bank regions, two (ECA and LAC) have DPT immunization rates that are significantly higher than the 100-country mean, one (SSA) has a DPT immunization rate that is significantly lower than the 100-country mean, and three (EAP, MNA, and SAS) have DPT immunization rates that are not significantly different from the 100-country mean.

Figure 10-4. Mean model of regional differences in DPT immunization rates, 2005 (N = 100 poor countries)

With two higher, one lower, and three the same, can we conclude that there are meaningful regional differences in DPT immunization rates? It's not so clear as in the EU example. The answer is probably yes (three of the six regions do show significant differences), but we don't have any firm guidelines to back this up.

10.2. ANOVA as a regression model Mean models can be useful for answering simple questions about the levels of variables, but most social scientists rarely use them. Social scientists usually want to study multiple aspects of how independent variables affect a dependent variable, and this can only be done in the context of regression modeling. In a regression model (or series of regression models), the total observed variability in a dependent variable can be divided up many different ways. The only real limitation of regression modeling is that all of the variables (both dependent and independent) have to be represented by numbers. You can't use a variable like the respondent's country of residence in a regression model. Most of the time when we talk about variables, statistics, and regression models we think of numbers. We have some data about a person (like age or years of education) that usually start at zero and run up from there. Numerical variables are variables that take numerical values that represent meaningful orderings of the cases from lower numbers to highest numbers. Numerical variables don't have to start at zero. For example, they could also be negative, like the gender gaps in science in different countries (Figure 7-1). It is possible for a variable to use numbers as values but still not be a numerical variable, but this is rare. For example, region codes for DVDs divide the world into six regions (1-6), but the numbers don't have any real meaning as numbers. There's no sense in which Region 2 (Europe) is "more regional" than Region 1 (North America). On the other hand, variables like WVS respondents' countries of residence (Figure 10-1) and the World Bank region in which a country is located (Figure 10-3) aren't attached to numbers at all. Instead, the values of these variables are names that describe groups of cases. Variables like World Bank region that describe groups of cases with names instead of numbers are called categorical variables. Categorical variables are variables that divide cases into two or more groups. Categorical variables include both variables with multiple groups (like World Bank region and WVS country) and variables with just two groups (like "gender" coded as being either male or female). Since categorical are not numeric, they can't be added, subtracted, multiplied, or divided. They also can't be used in regression models. When we've wanted to use categorical variables like gender in regression models, we've had to code them as 0/1 variables where one gender took the value "0" and the other gender was coded as "1." Even though the variable "gender" (male/female) is categorical, the variable "female" (0=no, 1=yes) is numerical. It represents how female the respondent is: 0 (not at all) or 1 (completely). Because "female" is numerical, it can be used in regression models. For example, in Model 2 of Figure 9-8 the coefficient for "female" was -7230, indicating that (after controlling for other factors) being female changed a person's expected wage by 1 x -7230 = -$7,230 compared to people who weren't female. Categorical variables with more than two groups are more complex. A special kind of regression model design exists to accommodate these variables. These models are called "analysis of variance" models. Analysis of variance (ANOVA) is a type of regression model that focuses on the proportion of the total variability in a dependent variable that is explained by a categorical variable. Since all regression models involve the analysis of variance, it's a little strange to use the name "analysis of variance" to refer to just this one type of regression model. Unfortunately, the name has been widely used in the social sciences for at least half a century, so it's too late to change it to something better. Partly as a response to this awkwardness, most social scientists today use the acronym "ANOVA" when referring to regression models that use categorical independent variables instead of spelling out the full name. Before categorical variables can be used in ANOVA models they have to be recoded into numerical variables. These new numerical variables are called ANOVA variables. ANOVA variables are the numerical variables in a regression model that together describe the effects of categorical group memberships. When a categorical variable only has two groups (like gender), it can be recoded into a single ANOVA variable (like female = 0 for men and 1 for women). This single numerical variable can then be used as an independent variable in regression models. When a categorical variable has three groups, two new variables are needed. For example, consider the variable "party affiliation" which in most US election surveys must take one of three values (Democratic, Republican, Independent). This can be recoded into the two ANOVA variables: Democratic -- coded 1 for Democrats and 0 for all others Republican -- coded 1 for Republicans and 0 for all others These two numerical variables can then be used as independent variables in a regression model. Why isn't there a third variable for Independents? Because if a person has the value "0" for the variable "Democratic" and the value "0" for the variable "Republican," that person must be an Independent. No extra variable is needed. More than that: if you tried to use a third variable for Independents in a regression model, the program wouldn't allow it. A categorical variable with two groups (gender) uses one ANOVA variable, a categorical variable with three groups (party) uses two ANOVA variables, a categorical variable with four groups uses three ANOVA variables, etc. The number of ANOVA variables is always one less than the number of groups in the original categorical variable. So for example the categorical variable describing what World Bank region a country belong to has six groups: EAP, ECA, LAC, MNA, SAS, and SSA. Before this categorical variable can be used in regression models, it has to be recoded into five ANOVA variables. One group is set aside and not made into a new variable. This group is known as the reference group. Reference groups are the groups that are set aside in ANOVA variables and not explicitly included as variables in ANOVA models. Setting aside SSA (Sub-Saharan Africa) as the reference group, the five ANOVA variables for World Bank region are: East Asia & Pacific -- coded 1 for EAP countries and 0 for all others Eastern Europe & Central Asia -- coded 1 for ECA countries and 0 for all others Latin America & Caribbean -- coded 1 for LAC countries and 0 for all others Middle East & North Africa -- coded 1 for MNA countries and 0 for all others South Asia -- coded 1 for SAS countries and 0 for all others Any country that is coded "0" on all five of these ANOVA variables must be, by process of elimination, an African country. A regression of DPT immunization rates on these five ANOVA variables is presented in Figure 10-5. The R2 of this model, 0.297, indicates that World Bank region explains 29.7% (almost 30%) of the total variability across countries in DPT immunization rates.

Figure 10-5. Regression of national DPT immunization rates on region using Sub-Saharan Africa as the reference group, 2005 (N = 100 poor countries)

The coefficients in Model 1 of Figure 10-5 can be read just like those of any other regression model. When all five independent variables equal "0," the expected value of the DPT immunization rate is 71.7% (the constant). What does this mean? It means that 71.7% is the conditional mean rate of DPT immunization in Sub-Saharan Africa. In a straightforward ANOVA model like this, the constant gives the conditional mean value of the dependent variable for the reference group. This is no different from a simple regression model like that reported in Figure 4-6, where the constant represented the mean income for women (since for women, the value of the variable "Male" was 0). In Figure 10-5, Sub-Saharan African countries have the value 0 on all five ANOVA variables. As a result, their expected DPT immunization rates are: 71.7 + 11.3 x 0 + 22.5 x 0 + 15.7 x 0 + 18.1 x 0 + 9.6 x 0 = 71.7%. You can confirm this by looking up Africa's mean rate of DPT immunization in Figure 10-4. The coefficients for the five ANOVA variables represent the difference in mean DPT immunization rates between Sub-Saharan Africa and each of the regions. For example, the expected DPT immunization rate for countries in the World Bank region of Latin America & Caribbean is: 71.7 + 11.3 x 0 + 22.5 x 0 + 15.7 x 1 + 18.1 x 0 + 9.6 x 0 = 87.4%. Again, you can confirm this by looking it up in Figure 10-4. Since the actual regression coefficients in an ANOVA model represent differences from the reference group, different reference groups will yield different results. In Figure 10-5, all the coefficients are positive because every World Bank region has a higher mean level of DPT immunization than Sub-Saharan Africa. By contrast, every World Bank region has a lower mean level of DPT immunization than Eastern Europe & Central Asia. In an identical ANOVA using Eastern Europe and Central Asia as the reference group, all the coefficients would be negative. This is illustrated in Figure 10-6. In Figure 10-6, the R2 is the same as in Figure 10-5, but all the coefficients (including the constant) have changed.

Figure 10-6. Regression of national DPT immunization rates on region using Eastern Europe & Central Asia as the reference group, 2005 (N = 100 poor countries)

The R2 is still 0.297, since World Bank region explains 29.7% of the cross-national variability in DPT immunization rates no matter what region is used as the reference group. The constant now represents the mean level of DPT immunization in Eastern Europe & Central Asia (the reference group). The coefficients for the regions now represent the differences between those regions' mean levels of DPT immunization and the level in Eastern Europe & Central Asia. Note that although all of the coefficients have changed, all of the expected values generated by the model remain the same. For example, the expected DPT immunization rate for countries in the World Bank region of Latin American & Caribbean is still: 94.3 - 11.3 x 0 - 6.8 x 1 - 4.4 x 0 - 12.9 x 0 - 22.5 x 0 = 87.5%. The slight difference from the earlier result (87.4% versus 87.5%) is due to rounding. No matter what group is chosen as the reference group in an ANOVA analysis, the R2 and the conditional means for each category (calculated from their expected values) remain the same. The only real difference between Figure 10-5 and Figure 10-6 is in the statistical significance of the coefficients. In Figure 10-5, all of the groups are compared to Sub-Saharan Africa, while in Figure 10-6 all of the groups are compared to Eastern Europe & Central Asia. The reported significance levels relate to how different the mean for each group is from the mean of the reference group, and different reference groups will produce different significance levels. As a result, in ANOVA models the specific statistical significance of each of the ANOVA variable coefficients is not usually very important. The ANOVA analyses reported in Figure 10-5 and Figure 10-6 are preferable to the six mean models reported in Figure 10-4 for several reasons. First, the ANOVA models tell us the total proportion of the cross-national variability in immunization rates that is due to differences between World Bank regions (almost 30%). Second, it tells us this using a single model (instead of six models). Third, it integrates the analysis of categorical independent variables into a regression modeling framework. This final point is by far the most important, because it allows us to apply all the tools of regression modeling to the study of the effects of categorical independent variables.

10.3. Mixed models ANOVA models are just regression models with a very particular setup of independent variables. Once appropriate ANOVA variables have been created to represent a categorical variable, they can be used in other regression models as well. For example, Figure 9-6 presented a series of seven regression models that used seven different variables to explain cross-national differences in DPT immunization rates. In Figure 10-7, these variables are combined with World Bank region into a single analysis that includes models that mix ANOVA variables with numerical variables. Mixed models are regression models that include both ANOVA components and ordinary independent variables. Like the coefficients of ANOVA models, the coefficients of mixed models are just ordinary regression coefficients and are interpreted the same way as any other regression coefficients.

Figure 10-7. Mixed models for DPT immunization on region, 2005 (after Figure 9-6; N = 100 poor countries)

Model 1 in Figure 10-7 is a base model that includes general development variables that are not directly related to immunization. Model 2 adds the five ANOVA variables for region. After controlling for level of development, the regional differences are much smaller than they were in Figure 10-6 (which also used Eastern Europe & Central Asia as the reference group). This indicates that most of the differences between region are due to regional differences in level of development. In fact, the R2 of Model 1 is 0.467 while the R2 of Model 2 is 0.476, for an improvement of just 0.009. This means that after controlling for level of development (Model 1), the additional explanatory power due to regional differences (Model 2) is just 0.9%. Health and demographic variables add much more explanatory power, bringing the final proportion of the cross-national variability in immunization rates explained by the models to 54.3% in Model 4. In the mixed model presented in Figure 10-7 the ANOVA variables add very little explanatory power and are not statistically significant (at least when Eastern Europe & Central Asia is used as the reference group). In other mixed models the ANOVA variables can have much greater impact. Figure 10-8 builds on Figure 9-8, using mixed models to improve our understanding of the gender gap in wages among American twentysomethings. In Figure 9-8, race was operationalized using a simple distinction between whites and non-whites. In Figure 10-8, race is operationalized as a four-group ANOVA variable, using whites as the reference group. Figure 10-8 also includes another ANOVA variable in Model 5: the industry in which a person works.

Figure 10-8. Mixed models to explain the gender gap in twentysomething wages in the United States, 2008 (after Figure 9-8; N = 7919 American twentysomethings)

Industry in Figure 10-8 is operationalized as a categorical variable taking four possible values: AMM -- Agriculture, mining, and manufacturing Trade -- Wholesale and retail trade Services -- Education, healthcare, financial, and other services Government -- Federal, state, and local government, plus non-profit organizations The highest-paid group, AMM, has been used as the reference group. The coefficients reported in Model 5 indicate that (after controlling for all other variables) people working in trade and services make significantly less than people working in AMM, while people working in government make slightly (not significantly) less. Controlling for industry has very little effect on the R2 of the models (R2 improves from 0.210 to 0.214, or 0.4%), but it has a big effect on the gender gap. In Model 4, the gender gap is $5,501. This means that even after controlling for age, race, ethnicity, education, marriage, children, employment status, and educational enrollment, twentysomething American women are still found to make $5,501 less than twentysomething American men. Even after all those controls have been taken into account, controlling for the industry in which a person is employed reduces the gender gap by a further $791 a year, to $4,710. This is a pretty substantial drop, considering that industry has been only very broadly accounted for (for example, education, health, and finance have all been lumped together in "services"). Better controlling for industry and occupation would likely reduce the gender gap further. On the other hand, the gender gap is still very substantial, amounting to over $4000 out of a typical wage of $20,000 or so a year. Ever after controlling for many competing explanations, the gender gap in wages for American twentysomethings is at least 20%, and probably larger.

10.4. ANOVA and the F statistic (optional/advanced) Despite the fact that ANOVA is (mathematically) a regression model, most textbooks present it before teaching regression and do not draw a connection between the two. Instead, ANOVA is taught only as a tool for evaluating group differences. In this approach, the key question asked in ANOVA analysis becomes: are there significant differences between groups in the values of the dependent variable? This question is answered by comparing the differences in the values of the dependent variable between groups to the remaining differences in the values of the dependent variable within each group. If the between-group differences are large relative to the within-group differences, the ANOVA model explains a significant portion of the overall variability in the dependent variable. If the between-group differences are very small, the ANOVA model is not significant. This traditional approach is illustrated in Figure 10-9. Figure 10-9 uses the same DPT immunization data as Figure 10-3, but unlike Figure 10-3 it shows the immunization rate for every one of the 100 countries in the database of poor countries. A few sample countries have been labeled on the graph. Each country deviates from the overall mean of 81.2% immunization by a different amount and for different reasons. For example, the DPT immunization rate in Guinea is just 51.0%, far below the overall mean of 81.2%. Part of the reason DPT immunization in Guinea is so low is that it's in Africa, and part of the reason is model error that is specific to Guinea. The same division can be made for every country: part of each country's deviation from the overall mean immunization rate is due to its region, while part of its deviation is due to model error.

Figure 10-9. Graphical illustration of the traditional ANOVA approach to modeling regional differences national DPT immunization rates, 2005 (after Figure 10-3; N = 100 poor countries)

In the traditional ANOVA model, the deviations of the regional means from the overall mean are all squared and summed up into a sum of squared deviations. The remaining deviations of the countries from their regional means are also squared and summed up. Then the two sums are compared to determine whether or not the deviations due to region make up a statistically significant proportion of the total squared deviations. This process is summarized in Figure 10-10 for the DPT data. The sum of squared deviations for the regions is 8000.93, while the sum of squared error deviations is 18944.78. Each sum of squared deviations is then divided by its degrees of freedom. Since the six regions can be fully described using five ANOVA variables, region overall (as a categorical variable) only has five degrees of freedom. As with the t statistic (Chapter 6), estimating the mean also takes up one degree of freedom. Since there are 100 cases, this means that there are 100 - 5 - 1 = 94 degrees of freedom left over for model error.

Figure 10-10. Traditional ANOVA model of regional differences national DPT immunization rates, 2005 (after Figure 10-5; N = 100 poor countries)

The ratio of the mean squared deviation per degree of freedom attributable to group effects to that attributable to error is called the "F" statistic (named in honor of statistician Ronal Fisher). The F statistic has two different degrees of freedom, one for its numerator and one for its denominator. In Figure 10-10, the F statistic for the explanatory power of region is 7.94 with 5 and 94 degrees of freedom. Using a reference book or statistical software program to check its significance, this F statistic is associated with a probability of 0.000. World Bank region significantly predicts DPT immunization rates. The F statistic is usually taught with reference to ANOVA, but it actually applies to all regression models. Regression output from statistical software programs almost always includes the F statistic. It's not often used because it is almost always statistically significant. It's a rare regression (or ANOVA) model that doesn't explain a significant proportion of the overall variability in the dependent variable. The F statistic has some useful advanced applications in comparing the explanatory power of nested regression models (sets of models in which one model includes all of the variables used in another model, plus some additional variables), but it is not very useful for describing the results of ANOVA models. The R2 statistic is usually far more useful for diagnosing whether or not group differences are substantively meaningful.

10.5. Case study: Racial differences in education in the United States Americans of difference races have always faced different educational opportunities. Before the 1960s, many schools and universities were entirely closed to black students, and some schools and universities also discriminated against students of Asian and other racial backgrounds. In addition to outright discrimination by schools, people of different races also faced a range of educational barriers based on income, location, awareness of opportunities, and many other factors. Figure 10-11 reports the racial differences in education among American adults age 30 and over among participants in the 2008 Survey of Income and Program Participation (SIPP), Wave 2. The focus is on people aged 30 and over because most people have finished their education by age 30.

Figure 10-11. Racial differences in education among American adults age 30 and over, 2008 (SIPP data)

The data reported in Figure 10-11 show clear differences in educational levels between races. Asian Americans having the highest levels of education, while people of African American and "Other" race (mainly Native Americans) the lowest. An ANOVA model confirms that all three of these races have mean levels of education that are significantly different from Whites.. The categorical variable "race" is operationalized into three ANOVA variables in Model 1 of Figure 10-12. The White group is used as the reference group. The education levels of Asians, blacks, and others are all highly significantly different from those of whites, with Asians having more education and Blacks and Others less. On the other hand, despite the fact that these differences are very highly statistically significant, they explain less then 1% of the total individual variability in education. Most of the individual variability in levels of education are apparently unrelated to race. It is possible that at least some of the differences in levels of education between races in the United States is due to differences in the age and gender composition of the American population by race: age and gender may be confounding variables in Model 1. This proposition can be examined using a mixed model that controls for the numerical variable age alongside the categorical variable race. A base model using just age and gender to predict education is presented in Model 2 and a mixed model including race as well in Model 3. The coefficients of the ANOVA variables in Model 3 indicate that the gaps between Whites and Blacks and Whites and Others are actually larger after controlling for age and gender, not smaller. Controlling for age and gender, Blacks receive on average 0.581 fewer years of education than Whites, while Others receive 0.714 fewer years of education.

Figure 10-12. Mixed models to explain racial differences in education among American adults age 30 and over, 2008 (N = 53,560 individuals)

Clearly, racial differences in education exist in the United States. That is not very surprising, given what we know about the long history of racial discrimination and disadvantage in US society. It might be more interesting to know how the racial gap in education has changed over time. The older Americans in the SIPP sample grew up in a racially segregated America that often didn't allow non-Whites to attend universities, while the younger Americans in the SIPP sample grew up in a society that was officially race-free or even promoted attendance for racial minorities. Hopefully, this means that the racial gap is declining over time. Is the racial gap in education smaller for younger Americans who came of age in the 1980s and 1990s than it is for older Americans who came of age in the 1950s and 1960s? Answering that question requires a new type of model, the interaction model, that is the focus of Chapter 11.

Chapter 10 Key Terms[edit | edit source]

  • Analysis of variance (ANOVA) is a type of regression model that focuses on the proportion of the total variability in a dependent variable that is explained by a categorical variable.
  • ANOVA variables are the numerical variables in a regression model that together describe the effects of categorical group memberships.
  • Categorical variables are variables that divide cases into two or more groups.
  • Mixed models are regression models that include both ANOVA components and ordinary independent variables.
  • Numerical variables are variables that take numerical values that represent meaningful orderings of the cases from lower numbers to highest numbers.
  • Reference groups are the groups that are set aside in ANOVA variables and not explicitly included as variables in ANOVA models.


Chapter 9 · Chapter 11