Artificial Intelligence for Computational Sustainability: A Lab Companion/Machine Learning for Prediction

From Wikibooks, open books for an open world
Jump to: navigation, search

Overview[edit]

Machine learning for purposes of predicting properties of objects and events -- as opposed to machine learning for purposes on improving search, planning and problem solving -- is the dominant form of machine learning studied (though the latter is often usefully understood in terms of the former). AI textbooks that include substantive machine learning content, particularly of the type described in this chapter of the lab companion include Russell and Norvig, 2010[1], Chapter V; Poole and Mackworth, 2010[2], Chapter 7 and Sections 11.1-11.2; Luger, 2009[3], Chapters 10-11. In addition, there are several textbooks on machine learning (Hastie, et al, 2009[4];Langley, 1996[5]; Mitchell, 1997 [6]) and other online resources, such as Ng's (2011) online videos[7].

Supervised Learning: Classification[edit]

Classification and Species Distribution[edit]

Lab: Species Distribution Modeling Using Maximum Entropy[8][edit]

The distribution of each species is determined by a combination of factors, including climate, resources, and dependence on other species. This unique combination of factors determines where different species can live successfully. Even if a single species could survive in a particular climate and habitat, they may not have the resources to survive or reproduce.

Consider the example of Joshua trees, which are confined to elevations between 400-1800 m (2,000-6000 ft.) in the Mojave Desert. For Joshua trees to grow any lower than 400 m or any further south than the Mojave Desert would be suicide by drought. However, they grew lower in elevation and further south during the cooler and wetter climate of the last glacial period. This means that Joshua trees expand their range when they can. So why don’t they live in coastal southern California? If a Joshua tree could take a coastal vacation, it would likely find the climate to be ideal for growth. However, it would never reproduce. To reproduce, Joshua trees depend on a variety of yucca moth that is genetically programmed for stuffing a little ball of pollen into the cup-shaped stigma of Joshua tree flowers. This relationship is mutually vital for both plant and moth, and for a complexity of reasons that are not fully understood, the Mojave Desert is where these two species live together.

In this lab, you will examine the effects of climate and climate change on the distributions of several species of tree, and then use climate and species-range data to construct computational models of species distribution using maximum entropy modeling (also known as Maxent)[9][10][11].

Maxent is a general method from information theory for finding the probability distribution that has maximum entropy (i.e., is the most non-committal, or closest to a uniform distribution), subject to a set of constraints that represent our partial knowledge of the target distribution. In this case, we have partial knowledge of the species' presence at specific points over the map; these known sample points serve as the constraints on the probability distributions. The goal of Maxent is to generalize these samples, following the principle of maximum entropy, to estimate the species distribution over the entire map. Each location on the map, including the known samples, is characterized by a set of climate variables, such as mean annual temperature, mean diurnal temperature range, mean precipitation during the coldest quarter of the year, etc. The species distribution is learned in this multi-dimensional feature space.

This lab is designed to augment in-class discussions on Maxent and species distribution modeling. It can be divided over multiple weeks based on the sections below.

Getting Started[edit]

To get started, download this zip file containing all data files needed for this lab. The zip file is approximately 74M and requires approximately 400M when uncompressed. It includes the species presence data, environmental data, a pdf of the climate maps, and a pdf describing each of the climate variables.

The Maxent software must be downloaded separately from http://www.cs.princeton.edu/~schapire/maxent/. These instructions were written for version 3.3.3k of the Maxent software, and so we recommend downloading that same version of the software. These instructions may need to be adapted for subsequent versions.

The directory structure of the zip archive, showing all key files and a complete installation of Maxent into the maxent directory.

Once you uncompress the zip file, you should have the following directories:

  • environmentBaseTemp
  • environmentIncrTemp
  • maxent
  • speciesData

and two files:

  • climate-maps.pdf
  • variables.pdf

Install the Maxent software to the maxent directory. Inside the maxent directory, you should now have four files:

  • instructions.txt
  • maxent.bat
  • maxent.jar
  • readme.txt

You are now ready to continue with the rest of the lab.

Examining Species Distributions[edit]

Examine the maps of California in climate-maps.pdf, which is available in the zip file downloaded in the Getting Started section. Each map depicts a single climate variable. Overlaid on each climate map are maps of six species’ ranges: bigcone Douglas fir (Pseudotsuga macrocarpa), Bishop pine (Pinus muricata), Blue oak (Quercus douglasii), Jeffrey pine (Pinus jeffreyi), coast redwood (Sequoia sempervirens), and giant sequoia (Sequoia giganteum).

After studying the maps, answer the following questions:

  1. Examine the first map (BIO1: Annual Average Temperature). Which species appears to survive best in cold temperatures?
  2. The third map (BIO3: Isothermality) compares the day-to-night temperature oscillation versus the summer-to-winter temperature oscillation. A value of 100 would represent a site where the diurnal temperature range is equal to the annual temperature range. A value of 50 would indicate a location where the diurnal temperature range is half of the annual temperature range.
    1. Which region has the highest isothermality (same temperate range)?
    2. What is a species that appears to grow well in a highly isothermal environment?
    3. What is a species that grows across a range of isothermalities?
  3. Examine all of the 20 maps and choose two species to focus on. Study the distribution of both species as they relate to the various climate variables.
    1. What two species did you choose?
    2. Do the species you chose appear to be confined to regions with cool summers?
    3. Does a spatial pattern in annual rainfall appear to correspond with the boundaries of any species’ range? Or is rainfall only important during a specific time of year?
    4. For each of your two species, what are the two climate variables that you hypothesize to be most the most important in dictating that species’ distribution? Why have you chosen each climate variable?

Learning a Computational Model of Species Distributions Using Maxent[edit]

The Maxent software [1] for species distribution modeling was developed in a collaboration between machine learning researchers and a biologist (emphasizing the interdisciplinary nature of computational sustainability) in 2004. It is a recent contribution from computer science / artificial intelligence that is now used widely by biologists and ecologists.

To learn the species distribution models, Maxent takes two inputs: (1) a file containing exact locations where a species of interest is known to grow and (2) a file containing climate data for each of those locations. By evaluating the climate data at each location where the species of interest is present, Maxent calculates a probability function that describes the chances of a tree location having any given climate setting. So if we were studying Joshua trees, Maxent would predict that if a Joshua tree is growing in a given location, there is a high probability that that location is hot rather than cold during summer. Next, Maxent flips this probability function around to predict the probability of species presence given a particular climate type. Therefore, Maxent would predict a high likelihood of Joshua tree presence in locations that are hot during summer and a low likelihood of presence in locations that are cold during summer. While this example focused on only one climate variable, Maxent generates the model and predicts the presence likelihood using multiple climate variables. Details on precisely how Maxent learns the model are given in the journal article by Phillips, Anderson, and Schapire (2006)[12].

Instructions to download and install the Maxent software are included in the Getting Started section. The data required by Maxent is included in the following three folders in the zip file:

environmentBaseTemp: the 20 climate parameters depicted in climate-maps.pdf
environmentIncrTemp: the 20 climate parameters, but with a uniform increase of 4°C
speciesData: the species presence locations

The file variables.pdf contains textual descriptions of each of the climate parameters.

To learn a computational model for the distribution of each of your two species: (Read these directions completely before you build your first model!)

  1. Run the Maxent program: java -Xmx512M -jar maxent/maxent.jar (Note that this example assumes you installed Maxent as specified in the Getting Started section.)
  2. Input the file containing the presence locations for your first species. Load this into the "Samples" section of the Maxent program.
  3. Load the climate parameters into the "Environmental Layers" section of the Maxent program by selecting the environmentBaseTemp folder.
  4. For this first species, you identified two climate parameters as potentially important in determining the species' distribution. Choose one of these climate parameters and select only the environmental layer corresponding to that parameter. (Hint: use the "Deselect All" button to make the process go quicker.)
  5. Select the options for "Create response curves" and "Make pictures of predictions."
  6. Create a new folder for the Maxent output in your own directory space, named according to the species name and environmental variables you're testing. (E.g., jeffpine_annualprecip) Select this folder for the "Output Directory."
  7. Run the model.
  8. Repeat steps 2-7 for each of your two species, testing only one climate variable each time. At the end of this step, you should have four output models (two for each of your two species).

Each output folder will contain a .html webpage that summarizes the model's information, including the predicted species distribution overlayed on a map and several performance curves, as shown in the figures below. Cooler colors (blue/green) indicate areas where the model calculates a low probability of species presence and warmer colors (red/yellow) indicate areas where the model calculates a higher probability of species presence. White squares indicate the locations specified in your species presence file. For the response curve (middle figure), the x-axis represents a variety of climate values (in this case the annual precipitation in mm) and the y-axis indicates the probability of finding the species of interest in an area with any given annual precipitation. So, the response curve below indicates that Jeffrey pine trees are most likely present in areas with an annual precipitation greater than 600mm.

Distribution of Jeffrey pine trees as predicted by Maxent modeling
Predicted distribution of Jeffrey pine
Response curve of the maxent model of Jeffrey pine trees in relation to annual precipitation
Response curve of the Jeffrey pine model
ROC curve of the maxent model of Jeffrey pine trees in relation to annual precipitation
Corresponding ROC curve

The rightmost figure depicts the receiver operating characteristic (ROC) curve for the model. The ROC curve depicts the classification performance of the model under different discrimination thresholds. The plot can be turned into a single summary statistic by taking the area under the ROC curve (AUC), but note that the AUC loses information about the tradeoffs in the model's performance in different regions.[13] Notice that the ROC curve lies in the unit square, so a model with perfect (100%) accuracy would have the red line go all the way to the upper left (coordinates (0,1)) and would have area 1 (although this is seldom achieved). The black line indicates the performance of random guessing, and so it has 50% accuracy and an AUC of 0.5. If the red line is below (to the right of) the black line, this indicates that we could have done better simply by random guessing. The AUC for this example is 0.86 as noted in the graph's legend (this is called the "training AUC" of the model). Note that this does NOT guarantee that the model will have an AUC of 0.86 for unseen data; in fact, the training accuracy is often a poor indication of general model performance (called "test accuracy" or "generalization accuracy").

To complete your analysis:

  1. Paste and label all four maps, response curves, and ROC curves in your lab write-up. For each response curve, construct a sentence or two that describes the plotted relationship.
  2. Where is one area where each model over predicted the probability of species presence? Why do you think this occurred?
  3. What do the ROC curves tell you about each of these models? Explain in a sentence or two.
  4. For each species, what is at least one other climate variable that you think you could add to improve the model’s performance? What would a successful model look like?

Improving the Model by Using Multiple Climate Parameters[edit]

In this section, we will build predictive models that combine information from multiple climate parameters to make stronger predictions.

  1. For each of your two species, now use multiple climate parameters to run a new model. Use your original two choices and any other climate variables that you identified in the last question from Part I. When re-running your model, also check the box on the MaxEnt screen labeled “Do jackknife to measure variable importance.” Be sure to create two new output folders within your working directory.
  2. Re-evaluate the two new maps of predicted species presence. Label and paste them into your lab write-up.
    1. Did incorporating more climate variables improve the model's performance? If so, where and why?
    2. Where does the model still seem to be inaccurate? Why do you think this is?

Look at the response curves for each of your two new models. Note that the top and bottom rows of response curves look different even though they represent the same climate variables. The bottom curves represent what each response curve would look like if it were the only variable used to predict the probability of species presence. The top curves show the actual relationships between all climate variables and species presence in your new model. The multivariate model may indicate a wider response range for a variable than was discovered using that variable alone.

Pick the single species and distribution model that interest you the most. Label and paste all response curves (top and bottom) from that new model into your lab write-up.

  1. What do the response curves from your chosen model tell you about the climate constraints on your species?
  2. Do any of the variables function very differently in this multi-variable model than they would alone (are any top curves very different than their counterparts on the bottom)? Why do you think this is?

Examining the Effects of Climate Change on Species Distributions[edit]

Much of the western United States became warmer during the 1900s. This warming is expected to continue for many years to come as a result of an increase in the amount of long-wave radiation emitted towards the ground by greenhouse gas molecules like CO2, CH3, and H2O. This is likely to affect forests substantially. Species living in hot, dry regions are likely to suffer as evapotranspiration rates (and thus drought) increase. Species living in cold regions may benefit as warmer temperatures may allow for photosynthesis earlier in the spring and later in the fall.

These changes are likely to impact forests most substantially at their boundaries, where trees stand on the front lines of a constant battle between survival and death. If temperatures warm, new seedlings and mature trees growing at the upper elevation tree line in the Sierra Nevadas will die less often and the upper tree line will rise. If evapotranspiration increases, new seedlings and mature trees growing on the lower elevation tree line between alpine forest above and desert scrub below will die more often and the lower tree line will also rise. This is how the edges of populations move when climate changes.

In this section, you will use the last multivariable model that you created in the previous section, and apply it to new climate data that assume a hypothetical change of 4°C. While real temperature change will be very spatially, seasonally, and diurnally variable (warming should be most substantial near poles, during winter, and at night), this hypothetical temperature change is applied everywhere at all times. So, we are assuming that diurnal temperature range and annual temperature range are unchanged. We also assume that rainfall is unchanged.

  1. Set up MaxEnt to run the same model as the last multi-variable model you created above. However, before running the model, use the Browse button next to “Projection layers directory/file” to select the environmentIncrTemp folder.
  2. Run the model, remembering to create a new output folder.
  3. Open the .html file in your output folder and scroll to the maps. The top map should be identical to the map produced by your last model run. Check to make sure it is. The bottom map shows the probabilities for species presence given the hypothetical warming of 4°C. Include this map in your lab write-up, placing it side-by-side with a map of the predictions for the current temperature. Label each map clearly.
  4. How do your predictions of species presence in a warmer climate differ from those in the current climate? Where are the regions where your species is no longer predicted to grow? Why do you think this is?
  5. Are there areas where the probability of species presence has increased? Where are they and why do you think this happened?

Instructor Summary for Species Distribution Modeling Lab

Classification and Species Identification[edit]

Example: OSU work on identifying anthropods

Supervised Learning: Regression[edit]

Regression is the problem of learning to predict an object's value along a continuously-valued dependent dimension (or variable) given the object's description along independent dimensions (or variables). Material on regression that would be necessary to complete many of the assignments in this section can be found in a variety of sources, including Russell and Norvig, 2010[1], section 18.6, pp. 717-723; Poole and Mackworth, 2010[2], section 7.3.2, pp. 304-305; Ng, 2011[7], videos II, IV. This material is almost entirely focused on linear regression, however (though this is sufficient for some assignments of this section). Few undergraduate textbooks go significantly into other forms of regression, such as polynomial regression and tree-structured regression, but these texts typically provide ample material for instructors and the lab text to get into these issues, perhaps with pointers to other online content that is created in response to the lab text’s coverage. For example, polynomial regression is realized by adding higher-order terms to the data, then using the machinery of linear regression.

A second form, of regression tree induction, is typically mentioned in AI textbooks as a variant of decision tree induction, which universally gets substantial coverage. An important aspect of both regression (and decision) trees is that they make explicit the important principle of "context" through the strategy of recursive decomposition – some variables may be informative in some contexts (e.g., subtrees), but not others. To include regression tree analysis, however, would undoubtedly require more tutorial information. The subject of regression trees is a good example of purely AI educational content that might be created in Wikipedia as a result of the work on the sustainability lab text in Wikibooks; there is currently no “regression tree” article in Wikipedia, a remarkable omission, though there is a reference to regression trees from Wikipedia’s “Decision Trees (for machine learning)” article. Repeating a desire expressed in Artificial Intelligence for Computational Sustainability: A Lab Companion/Introduction and Artificial Intelligence for Computational Sustainability: A Lab Companion/Guide for Contributors, editors might be motivated to add Regression Tree pages to Wikipedia as a result of this lab's activity.

Regression and Ecological Footprints[edit]

Quantitative measures of environmental impact enable computational and mathematical analyses of sustainability problems. Further, these measures facilitate visualization and other modalities for communication of environmental consequences to the public, policy makers and scientists. Formally, an ecological footprint (Wackernagel and Rees 1996; Global Footprint Network, 2012) is the amount of land (e.g., in hectares) that is needed to sustain indefinitely, without degradation, a process or entity, ranging in scale from (manufacture, use, and disposal of) individual artifacts to cities, nations and the world’s human population. Informally, the term “ecological footprint” refers to many kinds measures, such as greenhouse-gas and energy equivalence of a process or thing. Because an ecological footprint is typically a continuous value, it may be that regression can be used to learn an effective predictor.

Regression and Species Distribution[edit]

Unsupervised Learning[edit]

In supervised learning there is (typically) one attribute or variable, called the dependent variable, that is the focus of attention -- the goal of supervised learning from labeled data is to optimize prediction performance of this one dependent variable given some or all of the values of the remaining independent variables. In contrast, unsupervised learning can be cast as a problem in which no one variable is the exclusive focus of attention, but rather a system might be called upon to make predictions along any variables with unknown values, given known values for other variables. In this case, the goal of unsupervised learning is to optimize some composite prediction performance (undoubtedly with many interesting variants) across all (or perhaps selected) variables. This performance task is perhaps most obvious as the goal in unsupervised learning of (Bayesian) belief networks, but it can be equally regarded as a goal of clustering and association rule learning[14]. This unsupervised performance task, as elaborated in the context of clustering for flexible prediction or pattern completion[15][16] is a precursor to multitask learning as described by Caruana (1997)[17] There are other forms of unsupervised learning, most notably topic modeling [18] that can also be viewed along the lines of prediction performance.

Importantly, the treatments and presentations of unsupervised learning are generally much more varied and less predictable than the treatments of supervised learning, probably for a variety of reasons. Nonetheless, the unsupervised performance task as outlined here is a simple unifying theme across unsupervised methods and the reason the we place unsupervised learning under Machine Learning for Prediction.

Unsupervised Learning and Ecological Footprints[edit]

Unsupervised learning, such as belief network learning and clustering, can be used to discover, represent, and exploit statistical relationships between features and objects (e.g., people, processes, artifacts) for purposes of contextualizing and predicting ecological footprints.

Unsupervised Learning and Ecological Modeling[edit]

Sources[edit]

  1. a b Russell, S. & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (Third Edition). Prentice Hall, NJ
  2. a b Poole, D. and Mackworth, A. Artificial Intelligence: Foundations of Computational Agents, Cambridge University Press is freely available on the Web (http://artint.info/index.html)
  3. Luger, G. (2009) Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 6th Edition. Addison-Wesley
  4. Hastie, T., Tibrishani, R., and Friedman, J. Elements of Statistical Learning: Data mining, inference and prediction, Second Edition, retrieved from http://www-stat.stanford.edu/~tibs/ElemStatLearn/
  5. Langley, P. (1996) Elements of Machine Learning, Morgan Kaufman Publishers
  6. Mitchell, T. (1997) Machine Learning, McGraw-Hill
  7. a b Ng, A. (2011). Online machine learning videos, retrieved http://www.ml-class.org/course/video/preview_list
  8. This lab is based on the Species Distribution Modeling assignment developed by Park Williams (UCSB Geography). It was revised and enhanced for an AI audience by Eric Eaton (Bryn Mawr College).
  9. Phillips, S. J., Dudík, M., and Schapire, R. E. (2004). A maximum entropy approach to species distribution modeling. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML '04). ACM. doi: 10.1145/1015330.1015412.
  10. Phillips, S. J., Anderson, R. P., and Schapire, R. E.. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling 190 (3–4): 231-259 doi: 10.1016/j.ecolmodel.2005.03.026.
  11. Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E. and Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and Distributions 17: 43–57. doi: 10.1111/j.1472-4642.2010.00725.x
  12. Phillips, S. J., Anderson, R. P., and Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling 190 (3–4): 231-259 doi: 10.1016/j.ecolmodel.2005.03.026.
  13. http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve
  14. Fisher 2001, Editorial: Special issue on unsupervised learning. Machine Learning, 42, 5-7
  15. Fisher, D. (1987). Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning, 2:139-172. Reprinted in J. Shavlik & T. Dietterich (eds.), Readings in Machine Learning, 267--283, Morgan Kaufmann, 1990.
  16. Fisher, D. (2002). "Conceptual Clustering," in W. Klosgen and J. Zytkow (eds.), Handbook of Data Mining and Knowledge Discovery, Oxford University Press, 388--396, Chapter 16.5.2. Preprint retrieved from http://www.vuse.vanderbilt.edu/~dfisher/KDD-Handbook/clustering.pdf
  17. Caruana, R. (1997). Multitask learning: A knowledge-based source of inductive bias. Machine Learning, 28:41--75.
  18. Blei, D. 2012. Communications of the ACM, vol. 55, no. 4. Retrieved from http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf