# Using Regression to Make Predictions

Global warming is one of the greatest threats facing the world in the 21st century. Climate scientists are now absolutely certain that global warming is occurring and that it is related to human activity. The most obvious cause of global warming is fossil fuel consumption (though there are many other causes). Fossil fuels are minerals like coal, oil, and natural gas that were buried under the Earth's surface millions of years ago. Over the long history of the Earth, enormous amounts of carbon have been removed from the atmosphere by natural processes and deposited in the ground as minerals. Then, starting for real in the 1800s but gaining momentum in the 1900s through to today, we began digging and pumping these minerals out of the Earth to burn in our homes, power plants, and automobiles. Whenever we burn these carbon minerals, we release carbon dioxide (CO2) into the atmosphere, which leads to global warming. Global warming may seem like a topic for physical scientists to study, but really it is a social science topic. Physical scientists have told us how to stop global warming: if we just stop burning fossil fuels, the Earth will stop warming and eventually return to normal. The problem is that people don't want to stop burning fossil fuels. Changing people's attitudes and behavior is a social science problem. Figure 3-1 is an extract of data from a World Bank database called the World Development Indicators (WDI). The cases in the WDI database are countries. The columns of the database include two metadata items (the World Bank country code and the country name). Three variables are also included: CO2 -- Metric tons of carbon dioxide per person emitted by the country GNP -- The country's gross domestic product per capita, a measure of average national income CARS -- The number of passenger cars per 1000 residents of the country Countries were excluded where data were unavailable. For example, the WDI database included no passenger car data for Canada, so Canada is not included in Figure 3-1 or in the analyses to follow. Lack of data is the reason that the database includes data for only 51 out of the 200 or so countries of the world.

Figure 3-1. Carbon dioxide (CO2) emissions data for 51 countries from the World Bank, 2005

Presumably, countries that have more cars burn more gasoline. If so, we might hypothesize that the number of cars in a country should be positively related to its carbon dioxide emissions. Figure 3-2 shows a scatter plot of carbon dioxide emissions (dependent variable) versus passenger cars (independent variable) for the 51 countries represented in Figure 3-1. A linear regression model has been used to place a trend line through the data. While there is a lot of regression error around the trend line, the slope of the line is definitely positive. For every additional 100 cars in a country, the expected value of carbon dioxide emissions goes up by 1.25 tons per person. In other words, the slope of the regression line is 1.25 / 100 = .0125. This tends to support the hypothesis that numbers of cars are positively related to carbon dioxide emissions.

Figure 3-2. Carbon dioxide (CO2) emissions versus passenger cars for 51 countries, 2005

Two outliers in Figure 3-2 are the United States and Australia. Both have much higher carbon emissions than would be expected based on their numbers of cars. For the United States, this disconnect has a simple explanation: many Americans don't drive cars. They drive trucks and SUVs. These vehicles aren't included in the World Bank's "passenger cars" figures, but they certainly burn gasoline and produce carbon dioxide -- lots of it. For Australia, the explanation is more complicated, but Australia's high levels of carbon dioxide emissions are partly due to a heavy reliance on coal for electricity generation. Other countries that deviate from their expected levels of carbon emissions (Singapore, Kazakhstan) have their own stories. Overall, though, when countries have more cars they're likely to emit more carbon dioxide. This result is robust: removing Australia, the United States, Singapore, or Kazakhstan has little effect on the slope of the regression line. One interesting feature of Figure 3-2 is the expected value of carbon dioxide emissions when there are no cars in a country. This can be determined by finding zero on the passenger cars axis and reading up the graph until you hit the regression line. According to the regression line, when the number of cars is zero the expected level of carbon emissions is around 3 tons per capita. This implies that even if we gave up driving entirely, we would still have a problem with global warming. The reason is that there are many other sources of carbon emissions besides cars. We burn coal in power plants to generate electricity. We burn natural gas to heat our homes. Even without cars we would still have ships, trains, and airplanes burning oil. Solving global warming is going to be very difficult. A first step in solving global warming might be to give up driving cars. Giving up cars is not going to be easy. Cars are everywhere, and most of us drive every day. Over the past fifty years countries like the United States, Canada, and Australia have rebuilt themselves around the automobile. It's hard to get anywhere today without a car. The results presented in Figure 3-2 suggest that we should at least start to solve global warming by driving less. Reducing cars would reduce emissions a lot, even if it wouldn't reduce them to zero. Predictions about what would happen if we made changes in our lives can help us decide what kinds of changes to make. The formulation of social policies to fight problems like global warming requires us to make predictions like these. Social scientists try to answer questions about how the world will change in the future depending on the policies we put in place today. Regression models help us answer social policy questions like this. Regression models can also be used to predict things like people's incomes and voting behavior. Simple scatter plots may be useful for helping us understand the overall shape of the relationship between two variables, but regression models go much further in enabling us to make concrete predictions.

This chapter focuses on showing how linear regression models can be used to make predictions about the values of dependent variables. First, like any line a regression like has both a slope and an intercept (Section 3.1). Slopes were covered in Chapter 2, but intercepts also add important information about a line. Second, regression slopes and intercepts are both necessary to compute the expected values of a dependent variable (Section 3.2). Expected values can be used to make predictions about the dependent variable. Third, expected values can be used to predict values of the dependent variable even where data for those variables are missing for particular cases (Section 3.3). As might be expected, predictions made within the range of prior experience tend to be better than predictions about things that have never been observed before. An optional section (Section 3.4) shows how regression prediction can be used to compare different groups in society. Finally, this chapter ends with an applied case study of the relationship between the racial makeup of the population and presidential voting patterns in the 2008 election across the 50 states of the United States (Section 3.5). This case study illustrates how regression lines are drawn based on both their slopes and their intercepts, how the expected values of dependent variables are calculated, and how mean levels of variables can depend on the values of other variables. All of this chapter's key concepts are used in this case study. By the end of this chapter, you should be able to use the results of regression models to understand the determinants of real-world outcomes that are of interest to social scientists.

3.1. Slopes and intercepts The most important feature of a regression line is usually its slope. In many situations, however, when we also want to know the value of a regression line when the independent variable equals zero. In scatter plots like Figure 3-2 and Figure 3-3, the independent variable equals zero at the point where the regression line intercepts the dependent variable axis. Intercepts are the places where regression lines cross the dependent variable axis in a scatter plot. Intercepts can provide meaningful information for interpreting a relationship, as in Figure 3-2 and Figure 3-3, but they are also useful in their own right. If you know both the slope of a regression line and its intercept, you can draw the whole line and every point on it. The use of a slope and an intercept to draw a regression line is illustrated in Figure 3-3. Figure 3-3 shows the regression line connecting passenger cars to carbon emissions from Figure 3-2, but the actual data points have been hidden to show just the line. The slope of the line is .0125, meaning that every 100 extra cars corresponds to a 1.25 ton increase in per capita emissions. The intercept is around 3. To keep all the calculations simple, we'll assume it is exactly 3.00. Starting from this regression intercept of 3.00, an additional 100 cars is associated with an increase of 1.25 in carbon emissions. So the first 100 cars result in carbon emissions of 3.00 + 1.25 = 4.25 tons per capita. Adding 100 more cars on top of these results in carbon emissions of 4.25 + 1.25 = 5.50 tons per capita, and so on. Starting from the intercept of 0 cars and 3.00 tons of carbon, we can draw the whole regression line point by point by using the slope.

Figure 3-3. Regression of carbon dioxide (CO2) emissions on passenger cars (from Figure 3-2)

It takes up much less space to give just the slope and intercept of a regression line than to graph the whole line on a scatter plot. The regression model graphed in Figure 3-2 and Figure 3-3 is summarized in a table in Figure 3-4. In a typical regression table, independent variables are listed in the first column, with the regression coefficients listed in the following column. Regression coefficients are the slopes and intercepts that define regression lines. In Figure 3-4, there is only one regression model (Model 1) and it only has two coefficients (an intercept and a slope). The intercept (3.00) is listed next to an entry called "[Constant]." The intercept is denoted by "[Constant]" in brackets because, although it's included in the variable list, it's not actually a variable. The terms "constant" and "intercept" are used interchangeably by social scientists.

Figure 3-4. Regression of carbon dioxide (CO2) emissions on passenger cars (tabular form)

The slope associated with the independent variable "Cars" (0.0125) is listed next to the entry for "Cars." If there were more independent variables, they would be listed in additional rows. Similarly, if there were more regression models, they would be listed in additional columns. Regression tables are especially convenient for reporting the results of several regression models at the same time. In Chapter 2, the percent of Australians who felt unsafe walking alone at night was regressed on state crime rates (Figure 2-2) and personal experiences of violence (Figure 2-4). Instead of using scatter plots, the results of these two regression analyses can be summarized compactly in a single table, as shown in Figure 3-5. All of the coefficients associated with both models are reported in this one table.

Figure 3-5. Regression models for the percent of Australians who feel unsafe walking alone at night

The table in Figure 3-5 shows that in Figure 2-2 the intercept was 8.34 and the slope was 3.20, while in Figure 2-4 the intercept was 3.39 and the slope was 1.37. With just this information, it would be possible to draw the regression lines from the two figures. This information also contains most of the important facts about the two regression lines. For example, we know that even if crime rates were zero in a given state, we would still expect 8.34% of the people in that state to feel unsafe walking alone at night. Similarly, even if no one in a state ever experienced violence personally, we would still expect 3.39% of the people in that state to feel unsafe walking alone at night. Since both slopes are positive, we know that both actual crime and people's experiences of violence make them feel more unsafe when they go out alone at night. To see the regression error and outliers associated with these two regression models we would need scatter plots, but the table of coefficients gives us the basics of the models themselves.

3.2. Calculating expected values Tables of regression coefficients that include slopes and intercepts can also be used to compute the expected values. This should not be surprising, since slopes and intercepts are used to plot regression lines and expected values are just the values on the regression line. Returning to the relationship between passenger cars and carbon dioxide emissions, the slope is 0.0125 and the intercept is 3.00 (Figure 3-4). The slope and intercept define the regression line: the line starts at 3.00 tons of carbon emissions when the number of cars equals 0, then goes up by 0.0125 tons for every 1 additional car. An increase of 0.0125 for every car is the same as 125 for every 100 cars (Figure 3-5). As shown in Figure 3-3, the expected value of carbon emissions for 0 cars is 3.00 tons. For 100 cars the expected value is 4.25 tons. For 200 cars it is 5.50 tons. And so on. Reading expected values off a chart like Figure 3-2 or Figure 3-3 is one way to find them, but a better way is to use the slope and intercept in an equation to compute them. For example, the equation to compute expected values of carbon emissions is depicted in Figure 3-6. This equation uses the slope and intercept for carbon emissions that were reported in Figure 3-4. These are the same slope and intercept that were also used in the scatter plots of carbon emissions versus passenger cars.

Figure 3-6. Equation to compute the expected values of carbon dioxide (CO2) emissions (from Figure 3-4)

Using this equation, it is possible to calculate the expected value of carbon emissions for any level of passenger cars. For example, the level of passenger cars in the United States is 461 cars per 1000 people. Using the equation presented in Figure 3-6, the expected value of carbon dioxide emissions in the United states = 3.00 + 0.0125 x 461 or 8.7625 metric tons per capita. Rounding off to the nearest decimal place gives an expected value of carbon emissions in the United States of about 8.8 metric tons per capita. The actual value of carbon emissions in the United States, 19.5 metric tons, is obviously much higher than expected. As suggested above, this is because nearly half of all Americans drive SUVs and trucks, not cars.

Figure 3-7. Expected and predicted values of carbon dioxide (CO2) emissions (from Figure 3-2)

Predicted values and expected values are very similar concepts. In fact, many people use the two terms to mean the same thing. The difference between them is really just a difference in intentions. The regression line plots the expected values of the dependent variable based on the actual observations of the independent variable. Predicted values are expected values that are used to make predictions about cases for which data do not exist. For example, in Chapter 1 when we were using state median income to study soft drink consumption across the United States we were missing soft drink data for Alaska and Hawaii. Both Alaska and Hawaii were missing data for the dependent variable. Data for the independent variable, state median income, are available for both states: \$60,945 for Alaska and \$65,146 for Hawaii. These income data can be combined with a regression model using data from the rest of the United States to predict soft drink consumption in Alaska and Hawaii. Figure 3-8 reports the results of a regression model with state median income as the independent variable and state per capita soft drink consumption as the dependent variable. The regression line in this model has an intercept of 93.9 and a slope of -0.60. This means that every \$1000 of additional income is associated with a decline of 0.60 gallons in the amount of soft drinks consumed. This regression line is the line that appears on the scatter plot in Figure 1-2. The equation for this line is Soft drink consumption = 93.9 - 0.60 x State median income in thousands of dollars.

Figure 3-8. Table of regression results for the regression of soft drink consumption on state median income (from Figure 1-2)

This equation can be used to calculate predicted values for soft drink consumption in Alaska and Hawaii. Alaska's state median income is approximately \$61,000 (rounding off to the nearest thousand to simplify the calculations). The level of soft drink consumption in Alaska predicted by the regression model is 93.9 - 0.60 x 61 = 57.3 gallons. Hawaii's state median income is approximately \$65,000 (again rounding to the nearest thousand). Going through the same process for Hawaii gives a predicted value of 93.9 - 0.60 x 65 = 54.9 gallons. The predicted values of soft drink consumption are depicted on the scatter plot of state income and soft drink consumption for the other 48 states and the District of Columbia in Figure 3-9. Alaska and Hawaii may not have exactly the levels of soft drink consumption plotted in Figure 3-9, but these predicted values are the best guesses we can make based on the data we have. They are predictions of how many gallons of soft drinks Alaskans and Hawaiians would be found to drink, if we had the data.

Figure 3-9. Predicted values of Alaska and Hawaii's soft drink consumption (from Figure 1-2; note that the regression intercept where income = \$0 falls over the left edge of the plot and is not depicted)

Predicted values can be computed in two different situations. They can be either be calculated for values that fall inside the range of the observed data or for values that fall outside the range of the observed data. Interpolation is the process of using a regression model to compute predicted values inside the range of the observed data. All of the predicted values calculated above -- carbon emissions in Canada, soft drinks in Alaska, and soft drinks in Hawaii -- are examples of interpolation. In all three cases, the values of the dependent variables were within the ranges of values that had already been observed for other cases in the analyses. Sometimes, however, we want to make predictions outside of the values that have already been observed. Extrapolation is the process of using a regression model to compute predicted values outside the range of the observed data. For example, predicting how much carbon emissions there would be in a world with no passenger cars requires extrapolation. There are no countries in the world today that don't have passenger cars. Even Niger in west Africa has 4 passenger cars per 1000 people. Social scientists are usually comfortable with interpolation but cautious about extrapolation. This is because the interpolation of predicted values is based on actual experiences that exist in the real world, but extrapolation is not. For example, we may not know Alaska and Hawaii's levels of soft drink consumption, but we do know the levels of other states with similar income levels. This information can be used to predict Alaska and Hawaii's levels with some confidence. On the other hand, we might hesitate to use the data graphed in Figure 3-9 to predict Puerto Rico's soft drink consumption. Puerto Rico's median income is only \$18,610. This is far outside the range of the available data. Using the equation of the regression line from Figure 3-10 to predict Puerto Rico's soft drink consumption would give a predicted value of about 82.7 gallons per capita, but most social scientists would not feel confident making such a prediction.

3.4. Comparing populations using predicted values (optional/advanced) In America, on average, women make less money than men and blacks make less money than whites. According to data on Americans aged 20-29 from the 2008 Survey of Income and Program Participation (SIPP), women earned \$4966 less than men and blacks earned \$6656 less than whites (on average). These figures are based on a random sample of 4964 full-time employed American twentysomethings in 2008. The data come from Wave 2 of the 2008 SIPP. Income here is defined as wage income (income earned through working a job, as opposed to making money through investments) and is calculated as twelve times the monthly income recorded in the SIPP. The gender gap in wage income is large, and the race gap is larger still. These gender and race gaps in wage income may be due to discrimination, or they may be due to other causes. For example, it is possible that the white men who were given the SIPP survey happened to be older than the people in the other groups. If they were older, they would be expected to have higher incomes. The white men may also be different in other ways. They might have more experience, or more education. It is possible that a portion of the gender and race gaps can be explained by the specific characteristics of the particular people in the sample. In order to compare incomes fairly, it's important to compare like with like. Later chapters of this book will discuss how to "control for" confounding influences like age, education, and experience, but predicted values can also do the job in some situations. For example, predicted values can be used to predict what the incomes in each group would be if all the people were the same age. People's incomes definitely rise with age, starting around age 20. Figure 3-10 reports the results of four regression models using age as the independent variable and wage income as the dependent variable: one for black females, one for black males, one for white females, and one for white males. Note that the intercepts are not very meaningful here. The intercept is the expected value of the dependent variable when the independent variable equals zero. In Figure 3-10, the intercept would represent people's expected wage incomes at age 0. Obviously, that's not a very meaningful concept. It's also an extreme extrapolation from the range of the observed data, which are based on people ages 10-29. In short, the intercepts in Figure 3-10 are just the places where the regression lines start from. They don't have any real meaning beyond that.

Figure 3-10. Table of regression results for the regression of wage income on age for employed SIPP subjects aged 21-30, by race and gender, 2008

The slopes of the regression models reported in Figure 3-10 contrast the effects of an extra year of age on peoples' wage incomes for different groups of people. For black women, every additional year age yields, on average, an extra \$1421 of wage income. Black men don't get quite as much advantage from getting a year older, just \$1281. The big difference comes with white women and men. For white women, each year of age yields on average an extra \$2076 of wage income. The benefits of age for white men are even greater. For white men, every year of age yields on average an extra \$2830 of wage income. The expected payoff of an extra year's age for a white man is nearly twice as high as the average payoff for a black woman. The coefficients of the for regression models reported in Figure 3-10 can be used to calculate predicted values for the wage incomes of black females, black males, white females, and white males at any given age. From Figure 3-10, the regression model for black women was Wage income = -7767 + 1421 x Age. For black women of various ages, this works out to: Age 25: Wage income = -7767 + 1421 x 25 =\$27,758 Age 30: Wage income = -7767 + 1421 x 30 =\$34,863 Age 40: Wage income = -7767 + 1421 x 40 =\$49,073 These figures are reported in Figure 3-11 in the column for black females. Figures for black men, white women, and white men are calculated in the same way. The prediction of wage income at age 25 for each group is an interpolation, since the ages of the SIPP participants in the study were 20-29. As an interpolation, it should be a pretty accurate estimate of the incomes of 25-year-olds in each category would be expected to earn. The prediction of wage income at age 30 is on the very edge between an interpolation and an extrapolation, so it might be less reliable. The prediction of wage income at age 40 is an extrapolation far out into the future, so far that most social scientists would not trust it at all. The age 40 extrapolation is included here just to illustrate how extrapolation works.

Figure 3-11. Table of predicted values of income by age based on SIPP data, by race and gender, 2008

What do the models tell us about discrimination? In the SIPP data overall, the income gap between women and men in their twenties is \$4966, while the income gap between blacks and whites is \$6656. Comparing the predicted incomes of people at age 25, the predicted income for black women is \$3082 less than that for black men, while the predicted income for white women is \$3994 less than that for white men. This means that taking into account race and experience, 25-year-old women earn something like \$3000-\$4000 less than men, not close to \$5000 as indicated by the raw data. Similarly, the predicted income at age 25 for black women is \$4855 less than that for white women, while the predicted income for black men is \$5757 less than that for white men. Again, the differences adjusted for age and sex are large, but not as large as raw race gap of \$6656. At age 25, the gender and race gaps in wage income are large, but not as large as might have been thought based just on the raw data.

Figure 3-12. Vote for Barack Obama versus state percent black, 2008

How is it possible that there was no relationship between the number of blacks in a state and the vote for Obama in that state, given that 96% of black Americans voted for Barack Obama? The answer is that in many states with large numbers of blacks, whites voted overwhelmingly for his opponent, John McCain. This trend was particularly pronounced in the South. The historical center of the struggle for civil rights for black Americans has always been the South, in particular the 11 states of the former Confederacy that seceded from the United States during the Civil War (1861-1865). The 11 Confederate states were strongly committed to continuing the institution of slavery, and after being readmitted to the Union they put in place policies and laws that discriminated heavily against their black citizens. Black Americans have suffered discrimination everywhere in the United States, but the levels of discrimination in the 11 former Confederate states have historically been much worse than elsewhere. Figure 3-13 plots exactly the same data as Figure 3-12, but divides the states into the 39 "free" states that were never part of the Confederacy versus the 11 former Confederate states that seceded from the United States during the Civil War. The free states are marked with diamonds and the former Confederate states are marked with X's. Separate regression lines have been plotted for the two groups of states. Among the 39 free states, states with higher black populations returned higher votes for Obama, as would be expected. Among the 11 former Confederate states, states with higher black populations actually returned lower votes for Obama.

Figure 3-13. Vote for Barack Obama versus state percent black, separating free states and former Confederate states, 2008

Figure 3-14 summarizes the regression coefficients for the lines plotted in Figure 3-12 and Figure 3-13. The line plotted in Figure 3-12 for all states is Model 1 in Figure 3-14. The free state line plotted in Figure 3-13 is Model 2 and the former Confederate state line plotted in Figure 3-13 is Model 3. The number of cases (N) for each model has been noted in the table. In Model 1, the intercept is 51.1 and the slope is -0.057. The intercept of 51.1 means that the predicted value of the Obama vote for a state with no black voters would be 51.1%. This is an extrapolation, since there are no states that actually have 0% black populations. In general, extrapolations are less reliable than interpolations, but in this case the extrapolation is very slight, since several states have black populations under 1%.

Figure 3-14. Table of regression results for regression models predicting the Obama vote by state characteristics

The slope in Model 1 is -0.057. This means that for every 1% rise in a state's black population, the Obama vote would be expected to fall by 0.057%. This is a very, very slight downward slope. The number of blacks in a state has essentially no effect on that state's total vote for Obama. Excluding the former Confederate states, the free states model (Model 2) has an intercept of 48.1. This means that Model 2 would predict a vote for Obama of 48.1% in a state that had no black voters. This is different from the prediction of Model 1, but not very different. Both predictions (51.1% from Model 1 and 48.1% from Model 2) are within the range of actual votes for Obama in states that have very small numbers of black voters, like Vermont and Wyoming. More interesting is the slope of Model 2. Focusing on just the 39 free states, the slope of the regression line is clearly positive. For the 39 free states, every 1% rise in a state's black population is associated with a 0.576% increase in the vote for Obama. This is a large effect. An increase of one point in the black population predicts an increase of half a point in the vote for Obama. Model 3 repeats the regression of the Obama vote on state percent black, but this time using only the 11 Southern states that were historically part of the Confederacy that seceded from the United States during the Civil War (1861-1865). Among the former Confederate states, the predicted value of the vote for Obama in a state with no black voters would be 47.3. This prediction is an extrapolation far outside the observed range of the numbers of black voters in these states, but it is still a credible figure. It is a little lower than the equivalent predictions from Model 1 and Model 2, but not much lower, and it is within the range of actually observed votes for Obama in the free states with small black populations. The more important coefficient in Model 3 is the slope. The slope is -0.114. This means that among the 11 former Confederate states, an increase of 1% in the percent of the population that is black is associated with a decline of 0.114% in the vote for Obama. Each one point increase in the black population predicts a decline of just over a tenth of a point in the vote for Obama. This is striking. Outside the South, the more blacks there were in a state, the more people voted for Obama. In the South, the more blacks there were in a state, the more people voted for McCain. High votes for John McCain are not evidence of racism. There is no reason to think that a 67.5% vote for McCain in Wyoming is evidence of racism in Wyoming. But in the states that have the worst history of racism -- and only in those states -- the vote for John McCain was strongest in the states that had the most black citizens. In other words, Southern whites were more likely to vote for McCain if they had black neighbors. If there were fewer blacks in a state, whites were more comfortable voting for Obama, but if there were more blacks in a state, whites tended to vote for McCain. This is very strong circumstantial evidence of a legacy of racism in those states. Further research would be necessary to more fully understand these voting patterns, but the regression models reported in Figure 3-14 do raise serious questions about race and racism in America today.

## Chapter 3 Key Terms

• Extrapolation is the process of using a regression model to compute predicted values inside the range of the observed data.
• Intercepts are the places where regression lines cross the dependent variable axis in a scatter plot.
• Interpolation is the process of using a regression model to compute predicted values inside the range of the observed data.
• Predicted values are expected values of a dependent variable that correspond to selected values of the independent variable.
• Regression coefficients are the slopes and intercepts that define regression lines.