Social Statistics, Chapter 9: Regression Model Design

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Regression Model Design[edit]

Many people worry that modern society is alienating. Alienation means that people feel disconnected from larger society. Alienation was one of the first problems studied by the sociologists who founded the discipline in the late 19th century, and it continues to be a major concern today. A major symptom of alienation is a lack of trust in society. In small, closed communities where everyone knows everyone else, people have the opportunity to develop trusting relationships over years of mutual interaction. In modern societies, people end up being strangers to the people around them in stores and restaurants, their neighbors, and even their extended families. People still have friends, but their friends are spread out in wide networks. The era of village societies when you were likely to marry your next-door neighbor disappeared long ago. Nonetheless, trust is extremely important for the functioning of modern society, especially in democratic countries. If people don't have trust in society, they won't help their neighbors in times of need, enter into long-term contracts like university degree programs, or participate in democratic elections. At the most basic level, trust in society is necessary for society to function at all. Without it, we're all on our own. The World Values Survey (WVS), conducted in more than 80 countries, includes six questions about trust in society. They are: How much do you trust your family? How much do you trust people in your neighborhood? How much do you trust people you know personally? How much do you trust people you meet for the first time? How much do you trust people of another religion? How much do you trust people of another nationality? Each question can be answered on four levels, ranging from 0 = "No trust at all" through 3 = "Trust completely." An index of overall trust in society can be calculated by adding together each respondent's answers to all six questions. This index, "Trust in society," then ranges from a possible low score of 0 (the respondent answers "No trust at all" on all six questions) to 18 (the respondent answers "Trust completely" on all six questions). Of course, most people fall somewhere in the middle. Using data from the United Kingdom edition of the 2006 World Values Survey (WVS), the mean level of trust in society was 12.5 with a standard deviation of 2.4. Most people in the United Kingdom have a high level of trust in society. The full distribution of trust in society in the United Kingdom is plotted in Figure 9-1.

Figure 9-1. Trust in society in the United Kingdom (0-18 scale), 2006

While most people in the United Kingdom have high levels of trust in society, there are still may people who don't. Regression models could be used to help us understand why. First off, we might expect that many of the differences between people in their levels of trust in society would be determined by basic demographic factors: who they are, where they live, and how they were brought up. These should certainly be included in any model as control variables. Since trust in society includes trust in your family and in people of other religions, we might also control for family and religious factors. People's levels of trust in society might also be determined in part by differences in the ways people of different social statuses experience society. A regression of trust in society on social status should show a positive, statistically significant effect of social status. People of high social status should show significantly more trust in society because society has, in general, been good to them, while people of low social status should show less trust in society. Using WVS data, ten variables have been selected to use in studying differences in trust in society. Seven of them are background variables and three are alternative operationalizations of social status. The ten variables are: Gender -- the respondent's gender, coded as Female=0 and Male=1 Age -- the respondent's age in years Size of city or town -- population of the respondent's city of residence, scaled from 1 = less than 2000 people to 8 = greater than 800,000 people Marital status -- whether or not the respondent is married (Married=1) Parental status -- whether or not the respondent is a parent (Yes=1) Religiosity -- coded on a ten-point scale from 1 = "religion is not at all important in my life" to 10 = "religion is very important in my life" Race -- white (1) versus non-white (0) Education -- the respondent's yeas of education Income -- the respondent's income decile (lowest tenth, second tenth, third tenth, etc.) Supervisory position -- the respondent supervises other people at work (Yes=1) The results of a series of regression models using these ten independent variables to predict trust in society are reported in Figure 9-2. The first column in Figure 9-2 reports the correlation of each variable with trust in society. The remaining columns report the regression results. All of the coefficients are standardized so that the effects of the three different social status indicators can be compared. The original, unstandardized variables for education, income, and supervisory position are all measured on different scales, so their unstandardized coefficients would not have been directly comparable. The significance level for each coefficient is marked with an appropriate symbol, and the R2 for each model is reported at the bottom of each column. The R2 scores indicate that the models explained anywhere from 5.5% to 9.3% of the total variability in trust in society in the United Kingdom.

Figure 9-2. Standardized regression models for trust in society in the United Kingdom, 2007 (N=567)

Model 1 includes all the background factors together in a single regression model. Interestingly, marriage and parenting seem to complement each other. Neither marital status nor parental status is significantly correlated with trust in society, but in the regression model both variables are significant when controlling for the other. Model 2 also includes all of the background factors, but adds education, which is highly significantly related to trust in society. In Model 3, income is found to be highly significantly related to trust in society, but in Model 4 the effect of holding a supervisory position at work is non-significant. In the final model, Model 5, the only social status variable that has a highly significant coefficient is education. The fact that the coefficients of all three social status variables are smaller in Model 5 than in the other models indicates that these three variables compete with each other in explaining trust in society. This would be expected, since they are all different ways of operationalizing the same concept. People who are highly educated tend to have high incomes and supervise other people at work, while people who are poorly educated tend to have low incomes and not supervise other people at work. Since the (standardized) coefficient for education in Model 5 is much larger than the (standardized) coefficients for income and supervisory position, we can conclude that education is the most important of the three social status determinants of trust in society. Which of the five models is the best model for explaining how social status affects trust in society? That depends on just what it is that the researcher wants to know about the effects of social status. All of the models add information that might be useful. A fuller picture of the determinants of trust in society can be developed using all five models than from any one of the models by itself.

This chapter examines in greater detail how independent variables are selected for inclusion in multiple regression models. First, background control variables that are not particularly of theoretical interest in an analysis are often lumped together in an initial base model (Section 9.1). The selection of variables to include in a base model depends on the kinds of cases being used: individual people, whole countries, or something in between. Second, the proper selection and layout of variables in regression analyses depend on the purposes for which the results of the models will be used (Section 9.2). The main distinction is between whether the models will be used for prediction or for explanation. Third, the concepts of competing and complementary controls can help make sense of some of the many reasons for including control variables in models (Section 9.3). Six reasons are highlighted, though other reasons are possible. An optional section (Section 9.4) focuses on the problems that can arise when a single model contains two or more variables that operationalize the same concept. Finally, this chapter ends with an applied case study of the gender gap in wages in the United States (Section 9.5). This case study illustrates how regression models can be designed to help shed light on an important topic in social policy. All of this chapter's key concepts are used in this case study. By the end of this chapter, you should have a more sophisticated understanding of how independent variables are selected for and used in regression models.

9.1. The base model Compared to the models used so far in this book, models like those reported in Figure 8-16 and 9-2 have a large number of independent variables. Most regression models used by social scientists include many independent variables, from 6 or 8 up to sometimes 20 or more. When models include so many variables, it is necessary to have some way to organize them. A good place to start is with a base model. Base models are initial models that include all of the background independent variables in an analysis that are not of particular theoretical interest for a regression analysis. For example, in studying the relationship between social status and trust in society, variables like gender, age, town size, and all the other variables included in Model 1 of Figure 9-2 are not of particular theoretical interest. They are only included in order to control for the backgrounds of the people in the study. Model 1 would be considered a base model for Figure 9.2. The kinds of variables that are typically used in base models for different kinds of data are summarized in Figure 9-3. The variables to be included in a base model often depend simply on what data are available. For databases where the cases are individuals, many different variables are usually available. As you move up the chain to larger and larger units, less and less data are available, and so fewer and fewer base model variables tend to be included. For models comparing countries, the one variable that is almost always included is national income per person. The inclusion of national income in regression analysis helps adjust for the fact that rich countries like the United States and Japan are different in almost every way from poor countries like Cambodia and Haiti. If a researcher using cross-national data didn't control for national income, people who disagrees with the researcher's views would almost certainly raise this as a major criticism of the researcher's regression models.

Figure 9-3. Typical base model variables for regression analyses using different kinds of cases

The main purpose of base models is usually to make cases equivalent for comparison. In the data underlying Figure 9.2, the respondents range in age from 16 to 89 years old. Some are British born and bred for generations, while others are recent migrants from Jamaica or Pakistan. Of course, some are men and some are women. They are an incredibly diverse group of people who have had very different experiences of society. Controlling for these background factors allows us to compare like with like. Because age is included as a control variable in the analyses, we can make statements like these about Model 2: Holding age constant, the relationship between education and trust in society is significant. For any given age, education has a significant effect on trust in society. The age-adjusted relationship between education and trust in society is significant. The relationship between education and trust in society is significant net of age. These are all different ways of expressing verbally the fact that we have controlled for age. In mathematical terms, part of the difference in trust levels between people of different ages has been attributed to age, while another part has been attributed to education (and other variables). An important function of the base model is to control for basic background variables that are likely to be confounded with the explanatory variables of interest in an analysis. Confounding variables are variables that might affect both the dependent variable and an independent variable of interest. So, for example, age affects trust in society (the correlation of r = 0.162 is the strongest for any of the 10 variables in Figure 9.2), but it also affects social status. Your own education level can only go up as you get older, but at any one time for society as a whole older people are less educated, because people used to spend fewer years in school than they do today. It turns out that among the 567 United Kingdom WVS respondents used in Figure 9-2, the correlation between age and education is r = -0.218. Since older people have lower education and higher trust in society, age is a confounding variable in the analysis of the relationship between education and trust. If regression analyses usually start out with a base model, they usually end with a saturated model. Saturated models are final models that include all of the variables used in a series of models in an analysis. Model 5 in Figure 9-2 is an example of a saturated model. Saturated models can sometime be difficult to interpret because of the large numbers of variables used, but they are almost always included for completeness.

9.2. Explanatory versus predictive models In between the base model and the saturated model, there are no real rules about what variables should be included in regression models or in what order. A common approach is to do what's been done in Figure 9-2: start with the base model, then add the independent variables of interest one at a time in separate models, then report a saturated model in which all of the independent variables are used at the same time. When models are designed to evaluate the relative strengths of different explanations of the dependent variable, the independent variables have to be entered one at a time in separate models. This makes it possible to compare how well each of them explains the dependent variable. Explanatory models are regression models that are primarily intended to be used for evaluating different theories for explaining the differences between cases in their values of the dependent variable. On the other hand, sometimes the objective of a regression analysis is simply to predict the values of a dependent variable, without any interest in the theoretical implications of the models. Predictive models are regression models that are primarily intended to be used for making predictions about dependent variables as outcomes. For example, in Figure 3-9 a very simple predictive model was used to predict levels of soft drink consumption in the US states of Alaska and Hawaii. In predictive models, it is less important to understand how the coefficients of variables change between models or to control for potentially confounding variables. All that really matters is getting a high R2 score, since R2 indicates the proportion of the total variability in the dependent variable that is accounted for by the model. In general, models with higher R2 scores make more accurate predictions of the dependent variable. Some of the key differences between explanatory models and predictive models are laid out in Figure 9-4. A major difference is the way that independent variables are selected for inclusion in each model type. The main objective of an explanatory model is to make inferences about the effects of different independent variables on the dependent variable. Independent variables are carefully selected for inclusion based on specific theoretical reasons, and unimportant or irrelevant variables are never included. Keeping the number of independent variables to a minimum also makes it easier to understand the role each one plays in explaining the dependent variable. In other words, explanatory models place a premium on parsimony. Parsimony is the virtue of using simple models that are easy to understand and interpret. A good explanatory model is one that sheds light on relationships that are of theoretical interest. In contrast, predictive models often take much more of a wild-west, anything-goes approach. So long as the independent variables are correlated with the dependent variable, they help in making predictions. A bizarre example of this is the use of sewage treatment flows in so-called "toilet flush models" of hotel occupancy rates. In beach resort areas, city managers want to know how many visitors they have over a major holiday weekend, but there is no single database that includes a list of all the people who stay in a city's hotels, in private rentals, or visiting friends and relatives. Instead, city managers use the amount of sewage flowing through their sewage treatment plants over the weekend to estimate the number of people who must have been in the city. There's no theoretical sense in which sewage causes people to visit a city, but sewage is a very good predictor of the number of people who actually have visited.

Figure 9-4. Models meant for prediction versus models meant for explanation

9.3. Reasons for controlling in explanatory models A major challenge in designing regression models is deciding just what to control for. In predictive models, the decision is easy: if a variable is available for use and it helps predict the dependent variable, use it. In explanatory models, the decision is much harder. There are at least six reasons why control variables might be used in an explanatory model, though others are possible as well. They are: A. To eliminate alternative explanations B. To compare the power of different explanations C. To hold constant a competing explanation D. To make cases equivalent for comparison E. To reduce model error F. To bring out effects that were hidden by error The first three reasons (A-C) mainly apply in cases where the control variable tends to compete with other independent variables in explaining the dependent variable. In such cases, the use of the control variable tends to reduce the size and statistical significance of the effects of the other independent variables. The last three reasons (D-F) mainly apply in cases where the control variable tends to complement the other independent variables. In such cases, the use of the control variable can actually increase the size and significance of the effects of the other independent variables. The six reasons and explanations of the kinds of situations in which they are used are summarized in Figure 9-5.

Figure 9-5. Six common reasons for the use of competing and complementary controls

All six reasons for using control variables can be illustrated using a series of regression models that are designed to shed light on the reasons why some countries are more successful than others at immunizing their children against common infections. Though there are some controversies surrounding its use, the combined diptheria-pertussis-tetanus (DPT) vaccine is widely used around the world to immunize infants between 12 and 23 months old against three potentially deadly childhood diseases. The World Health Organization and most national health authorities have official DPT immunization programs. Nonetheless, DPT immunization rates vary from under 40% in some of the poorest countries of Africa to over 98% in many of the middle-income countries of the middle east and eastern Europe. The DPT immunization rate is not a major policy issue in rich countries, both because immunization rates are usually over 90% and because the three diseases -- diptheria, pertussis, and tetanus -- are not generally life threatening in countries with good medical systems. On the other hand, in poor countries DPT immunization can literally be a matter of life and death for young children. From a policy standpoint, we would like to understand why DPT immunization programs are more successful in some countries than in others, especially in poor countries. Several explanations are possible. First, in many rich countries DPT immunization rates fall well below their potential because of parental concerns about the safety of vaccines, combined with the fact that these diseases are now so rare that most people no longer fear them. Parental fear of vaccines is difficult to measure, but (with a few exceptions) it doesn't seem to be a major factor in most poor countries. It would be useful to study the effects of parental fear across countries, but the data simply aren't available. Other explanations include countries' levels of development, how easy it is to reach infants who need to be immunized, the amount that countries spend on health, the number of trained medical personnel in a country who could give immunizations, and the number of children to be immunized. Specific variables that might be used to operationalize each of these explanations are: Level of development National income -- national income per person ('000s of US Dollars) Improved water -- percentage of the population with "improved" water supply (e.g., a well) Improved sanitation -- percentage of the population with "improved" sanitation (e.g., an outhouse) Ease of reach Urbanization -- urban population (percent of the total population) Health spending Health expenditure -- national health expenditure as a percentage of national income Trained personnel Doctors -- number of physicians per 1,000 population Number of children Fertility -- the mean lifetime number of children per woman Other variables could be included, but data for DPT immunization rates plus all seven of these explanatory variables are available in the World Development Indicators for 100 countries representing over 85% of the world's poor countries by population. From a social policy standpoint, we are particularly interested in knowing what can be done to increase the immunization rate. We can't easily make a country richer or more developed, and we can't do much to make children easier to reach or make there be fewer children. On the other hand, we can give countries foreign aid to help them increase their spending on health. We can also seek volunteer doctors to help in administering immunizations. An important policy question is thus: which would be more useful, giving money or finding volunteers? The series of regression models for DPT immunization presented in Figure 9-6 help answer this question. They also illustrate the six reasons for using control variables. The letter for each reason has been attached to its corresponding illustration in the regression table.

Figure 9-6. Standardized regression models for DPT immunization rates in poor countries, 2000s (N=100)

Moving from left to right across the models, the inclusion of national income per person as an independent variable (Model 1) is an example of a control variable that makes cases equivalent for comparison (D). The 100 countries included in the analysis differ enormously in their levels of wealth. Controlling for national income helps adjust for those disparities so that we can compare like with like. The inclusion of controls for improved water and improved sanitation (Model 2) is an example of the use of control variable to reduce model error (E). Notice how the R2 score jumps from 15.5% in Model 1 to 46.0% in Model 2. The difference (30.5%) means that the two variables added in Model 2 together explain almost one-third of the total variability in immunization rates. Water and sanitation aren't really direct causes of immunization rates -- you don't need a toilet to conduct an immunization -- but they are general attributes of countries that are more developed. The inclusion of urbanization (Model 3) controls for a potentially competing explanation (C). Despite being significantly correlated with DPT immunization, the coefficient for urbanization is not significant in the regression model after controlling for national income, water, and sanitation. That's not a problem. Urbanization is not being included because of its significance. It's being included because it could potentially compete with our two variables of interest, health expenditure and doctors. The coefficients of heath expenditures and the number doctors are compared in Model 4 and Model 5 (B). Even though the number of doctors has a larger effect, the effect of health expenditures is statistically significant, while the effect of the number of doctors is not. This is a contradictory result, and the reasons for it are not clear. We could try to eliminate one or the other theory by including both health expenditures and doctors in a single model to see if one or the other becomes clearly unimportant when controlling for the other (A). This is done in Model 6. Unfortunately, in Model 6 both coefficients are almost identical, and neither is strongly significant. The odd and ambiguous behavior of the coefficients for health expenditures and doctors may be due to the fact that some other factor is obscuring the true effects of each. A control variable that might bring out these true effects is the fertility rate (F). Countries with high fertility rates have large numbers of children compared to their numbers of adults. This places major burdens on their health systems, since children tend to require much more healthcare than adults. Of course, it places a particular burden on immunization programs, since it is children who receive the DPT vaccine. The same amount of health expenditure or the same number of doctors per person would have much less impact in a country with high fertility than in a country with low fertility. Controlling for fertility (Model 7) increases the coefficient for health expenditure and makes it clearly statistically significant. On the other hand, it dramatically reduces the coefficient for doctors. From Model 7, it seems clear that -- after controlling for other complementary and competing factors -- having higher health expenditures is far more important for promoting immunization than having more doctors. Based on the results reported in Figure 9-6, the best policy would be for rich countries to increase their aid to poor countries rather than to recruit volunteer doctors. If expenditures are increased while holding the number of doctors (and other factors) constant, we would expect immunization rates to rise. If the number of doctors is increased while holding expenditures (and other factors) constant, we would expect no significant change in immunization rates.

9.4. Partialling and the partialling fallacy (optional/advanced) In Model 5 of Figure 9.2, three different operationalizations of social status (education, income, and supervisory status) are used in the same model to explain trust in society. In this model, it turned out that education was significantly related to trust even after controlling for income and supervisory status, while the coefficient for income was only marginally significant and the coefficient for supervisory status was not significant at all. Supervisory position was never very closely related to trust, but in Model 3 income was very highly significantly related to trust in society. In fact, the coefficient for income in Model 3 had a probability of less than .01, indicating that there was less than a 1 in 100 chance that such a strong relationship could have arisen purely at random. Why was the coefficient for income so highly significant in Model 3 but much smaller and only marginally significant in Model 5? The answer, of course, is that education and income are competing controls. Like the coefficient for income, the coefficient for education declined in Model 5, just not as much. Could it have declined more? Since all three variables measure social status, we might have expected none of them to be have significant coefficients. After all, by including three operationalizations of social status in the same model we are effectively measuring the effect of social status while controlling for social status and then controlling again for social status. We might reasonably have expected the three variables to compete with each other more fully in explaining trust in society. We might have expected that, after controlling for social status in one way, other measures of social status would have had no additional impact on trust in society. This didn't happen in Figure 9-2, but it does happen all the time in regression modeling. When two or more operationalizations of the same concept are included in a regression model and they compete to the point where their coefficients end up being non-significant, they are said to "partial" each other. Partialling is a specific form of competition between variables in which the two (or more) variables are alternative operationalizations of the same concept. An example of partialling is depicted in Figure 9-7. Figure 9-7 reports the results of a series of regression models of county murder rates on two operationalizations of county income for 37 large US counties (populations between 500,000 and 1,000,000 people). County murder rates (per 100,000 population) come from the FBI Uniform Crime Reports database. County income is operationalized in two ways. The county poverty rate is the percent of the population in each county that lives on an income of less than the federal poverty line. County median income is the income of the average person in each county. County poverty rates and median incomes come from the US Census Bureau. County poverty rates and county median incomes are correlated r = -0.780. As incomes go up, poverty rates go down.

Figure 9-7. Regression of murder rates on poverty and income for US counties of population between 500,000 and 1,000,000 population, 2008 (N=37)

As would be expected, counties that have higher poverty rates also have higher murder rates (Model 1). Every 1% increase in the poverty rate is associated with a 0.082 person increase in the number of people murdered per 100,000 population. That's not a lot, but it is statistically significant (probability = 0.021, which is less than 5%). Also in line with expectations, counties that have higher median incomes have lower murder rates (Model 2). Every $1000 increase in median income is associated with a 0.031 person decline in the number of people murdered per 100,000 population. Again, the relationship is small but (just) statistically significant (probability - 0.050, or 5%). In Model 3, however, neither poverty nor income is significantly related to the murder rate. Both variables have non-significant coefficients. A researcher who only looked at Model 3 without running models like Model 1 and Model 2 that examined the effect of each variable individually might conclude that neither poverty nor income was significantly related to murder rates. This error is called the "partialling fallacy." The partialling fallacy is a false conclusion that independent variables are not related to the dependent variable when, in fact, they are. The partialling fallacy label applies only in those situations where the variables partialling each other are meant to operationalize the same concept. There are at least three ways around the partialling fallacy. The simplest is to pick just one operationalization of the concept and ignore any others. A better approach is to combine the multiple operationalizations of the concept into a single variable. At the most sophisticated level, multiple operationalizations of a concept can be used together in a model and their joint power to explain the dependent variable studied through their collective impact on the model's R2 score. Notice how the R2 score in Model 3 is slightly higher than that from Model 1 (0.146 versus 0.144). This indicates that poverty and income together explain slightly more of the cross-city variation in murder rates than does poverty alone. The joint analysis of R2 scores has the advantage that it allows the researcher to use all of the available data in all its complexity. On the other hand, complexity is also its main drawback. Sometimes joint analysis adds value, but most times it makes more sense just to keep things simple. Model 1 explains nearly as much of the variability in murder rates as Model 3, without the distraction of managing multiple variables. It would be a reasonable compromise to study city murder rates based on city poverty levels, without worrying about median income.

9.5. Case study: The gender gap in wages in the United States In all countries that have ever been studied, women receive substantially lower wages than men. This doesn't necessarily mean that employers discriminate against women, but the balance of the evidence is that they do. Nonetheless, not all of the age gap is due to discrimination. Two competing explanations of the gender gap are that women accept jobs that pay lower wages in order to have greater flexibility in their family lives and that women earn lower wages because they work fewer hours. There are also other potential competing explanations that will be examined in Chapter 10. Control variables can be used to help us evaluate the validity of these competing explanations. The raw gender gap was illustrated in Figure 4-6 and a primitive regression model of the gender gap was presented in Figure 7-8, but a much more detailed series of explanatory models of the relationship between gender and wages is presented in Figure 9-8. Explanatory models are used instead of predictive models because the goal of the analysis is to understand the gender gap in general, not to predict any particular woman's wages. As in previous chapters, the analyses are restricted to employed twentysomething Americans who identify themselves as being either black or white. Data from Wave 1 of the 2008 Survey of Income and Program Participation (SIPP) have been used. In Figure 9-8 individual wages are regressed on 8 independent variables (including gender) in a series of four regression models.

Figure 9-8. Models to explain the gender gap in twentysomething wages in the United States, 2008 (N=6796)

Model 1 is a base model that includes four background variables: the respondent's age, race, Hispanic status, and years of education. As expected, people have higher incomes when they are older, white versus black, non-Hispanic, and more educated. In Model 2, the coefficient of -7230 indicates that, on average, twentysomething women earn $7,230 less per year than twentysomething men, even after controlling for age, race, Hispanic status, and years of education. Model 3 adds two family variables, marriage and children. Married people earn more than single people and people with children earn less than people without children. These two variables remove some error from the model, but they have very little effect on the gender gap. The possibility that women make less than men because of family obligations can safely be discarded as a competing explanation for women's lower wages. The final, saturated model (Model 4) adds two labor market variables: whether or not people work full-time and whether or not they are attending school (which might mean that they're not working to full potential). Controlling for these competing explanations does reduce the gender gap, but only by $803, from $7,149 to $6,346. These competing explanations both have highly significant coefficients and seem to be important determinants of wages, but they do not explain the majority of the wage differences between women and men. Though the gender gap may not be due to discrimination, we can conclude from the models presented in Figure 9-8 that it probably isn't caused by family factors or labor market factors. Incidentally, it is impossible for any of the variables in Figure 9-8 to be confounded with gender because gender is determined randomly at the time of conception, but there may be other confounded effects. For example, it is possible that older people are more educated (since the youngest people in the sample would not have finished their educations by the time of the study) and earn higher wages, so education is likely to be confounded with age. This might be a problem if the purpose of the analysis was to understand the relationship between education and income, but age and education are used here only with the intent of making cases equivalent for comparison. Similarly, marriage and children may be confounded, but again this is not an issue from the standpoint of the gender gap. The models presented in Figure 9-8 are reasonably parsimonious. Very few variables are included, and all of them have statistically significant effects. A fuller model of twentysomething wages might control for many more variables and still not be considered overly complex. For example, an important alternative explanation of the gender gap in wages is that it might be due to women's choices of what industry to work in. This will be investigated further in Chapter 10.

Chapter 9 Key Terms[edit]

  • Base models are initial models that include all of the background independent variables in an analysis that are not of particular theoretical interest for a regression analysis.
  • Confounding variables are variables that might affect both the dependent variable and an independent variable of interest.
  • Explanatory models are regression models that are primarily intended to be used for evaluating different theories for explaining the differences between cases in their values of the dependent variable.
  • Parsimony is the virtue of using simple models that are easy to understand and interpret.
  • Predictive models are regression models that are primarily intended to be used for making predictions about dependent variables as outcomes.
  • Saturated models are final models that include all of the variables used in a series of models in an analysis.

Chapter 8 · Chapter 10