The recently published paper by Jane Elith and Catherine Graham et.al.”Novel methods improve prediction of species’ distributions from occurrence data” (EG06) is sure to be a landmark study in the field. EG06 compares 16 modeling methods using 226 well-surveyed species in 6 regions of the world. Measures of statistical skill on held back data show a spread from a wide range of methods including: the older methods such as BIOCLIM, DOMAIN, through GARP, GLM and GAM to the newer arrivals from machine learning MAXENT, BRT and community based method GDM, prompting the conclusion “novel methods improve prediction”. The work of a great many people is appreciated, as these results will no doubt be very helpful to many biodiversity modellers in the future.
Why novel methods work
EG06 attributes the success of the newer methods to representing complexity of the relationships of species to their environment.
One feature that they all share in common is a high level of flexibility in fitting complex responses
The same thing was found in the early 80’s when novel machine learning methods — particularly neural nets, decision trees (CART), and genetic algorithms (GARP) — were first used for species distribution modeling. When BIOCLIM and GLM were the only species distribution methods, these early experiments showed heuristic approaches from machine learning would benefit the field.
The opposing view still widely held is that approaches should be based strictly on ecological theory, such as using BIOCLIM to represent the Hutchinsonian niche. This view is valid too, if your primary aim is to evaluate the theory and not necessarily maximize accuracy. This is a familiar theme in ML — that real world performance requires heuristic complexity at the expense of theoretical elegance. Happily the species prediction problem has come to the attention of the leading edge researchers in present day ML, and both theory and practice will benefit from the interplay.
Historic progression of models
In addition to the complexity dimension, the range of statistical skill across methods also represents an historic progression. GARP in the early 80’s used genetic algorithms to combine the major methods of the time, BIOCLIM, GLM and surrogate into a multi-model rule-set. The strategy of using multiple models for prediction is also used in the highest performing method BST in EG06 (Boosted Regression Trees). Under ideal conditions the ensemble approach in GARP would be expected to be better than the worst of the methods it uses (BIOCLIM), but no better than the best of the methods (GLM), and this is shown in EG06. It is likely that other approaches with high performance such as boosted regression trees (BRT) have evolved experiences with from earlier regression tree algorithms such as classification and regression trees (CART).
Unanswered issues
The major unanswered issue in species distribution modeling is environmental data selection. In EG06, selection of environmental data reflects the typical practices:
The environmental data used for each region were determined according to their relevance to the species being modeled (Austin 2002) as determined by the data provider (Tables 1 and 3).
EG06 does not address the problem of environmental data selection. No more than about 13 environmental data sets were used in each region. No statistics were provided for the power of these datasets. This in no way undermines their conclusions. However, people want to develop the best models possible. It has been shown in “Improving ecological niche models by data mining large environmental datasets for surrogate models” (S06) that monthly climate variables may be much more effective than annual climate averages, suggesting the variables typically used are not the best possible variables.
Just as GARP arose out of concern with arbitrariness in use of functional forms and generalized over them, WhyWhere arises out of concern with arbitrariness in the selection of environmental data. This problem has only become apparent, as the number of data sets available has burgeoned. There is also the large dataset problem of modeling species distribution in the Marine environment where the depth parameters multiples the number of possible variables enormously (e.g. nutrient levels at each depth).
Where WhyWhere fits in
One of the findings in “Effects of sample size on accuracy of species distribution models” — (SP04) a major comparison of 1060 species in Mexico using logistic regression (LR), GARP and simple surrogate model (SS) — was that the old SS method performed surprisingly well. This was interesting, as a very simple approach can be the basis for a very high performance algorithm, enabling the analysis large data sets. Rather than use a more theoretically precise approach to clustering the environment, such as kmeans which I have found to be inefficient and unreliable, a quicker more reliable heuristic method for classifying colors in images was used to develop a practical approach to data mining in the order of 1000’s of datasets, called WhyWhere.
It was subsequently shown in S06 that a relatively simple, low dimensional SS model searching a very large set of data could outperform a more complex model using a small set of general environment variables. This is because some specific variables, particularly monthly climate variables, correlate well with most species, but these vary from species to species (see Surprising finding #3 for recent results). The opposing view is that variable selection should follow ecological theory. That a small set of climate variables represent ecological determinants adequately. This is rhetorically similar to the arguments for ecologically-based models. Valid, if and only if you want to sacrifice maximum accuracy — real world performance at the expense of theoretical elegance.
There is also a larger agenda. Just as GARP and other algorithms demonstrated the value of machine learning approaches in the 80’s, WhyWhere is promoting biodiversity modeling by demonstrating the value of data mining approaches. This strategy enables the best minds in computer science to engage productively with the biodiversity field, and promotes biodiversity modeling, still a minor player compared with climate and population modeling.
Role of WhyWhere
WhyWhere can be used in a ‘pre-modeling’ stage. Points can be run through the server here just to see which variables give greatest accuracy. After objectively determining the best variables from the currently 528 terrestrial variables available, you can include them in other approaches if required.
The advantages of using WhyWhere in a pre-modelling stage are:
- Greater objectivity in environmental variable selection
- Applicability to environments with a large number of variables (e.g. marine and to the depth dimension)
- Generality to applications other than species distribution (e.g. house prices and climate).
- Potentially more accurate models using optimal datasets
- No need for each person to develop a new set of variables
Research questions:
1. On average it appears the Worldclim monthly climate average datasets are most frequently the highest performers (see Surprising finding #3 for recent results). While BIOCLIM may have performed relatively poorly in EG06, these preliminary results suggest the datasets largely associated with them may be very powerful. Will the combination of monthly climate and other variables with the novel methods increase skill?
2. Can the novel methods be run on 1000 environmental variables, many of which are categorical with 100s of categories each? For example regarding BRT:
Therefore, it is not prudent to analyze categorical dependent variables (class variables) with more than, approximately, 100 or so classes.
The approach of using a low dimension models with a piece-wise fit seems necessary for achieving reliable high performance on large numbers of environmental data.
3. How should we best address the problem of burgeoning numbers of environmental correlates? This is the next nettle that must be grasped to continue to move the field forward.
4. Are the data and algorithms in EG06 freely available for benchmarking other models?
Conclusions
EG06 also provides an objective and comprehensive evaluation of statistical skill of a wide range of methods at predicting the distribution of well-surveyed species from presence-only data using a small number of generic data sets, identifying the best methods for future studies. The study did not address the important role of environmental data sets selection. Results and logic suggest that a wider range of environmental data than are currently used, such as the monthly climate averages, will improve accuracy even more. People should start using them.
References
EG06 - Jane Elith, Catherine H. Graham, Robert P. Anderson, Miroslav Dudík, Simon Ferrier, Antoine Guisan, Robert J. Hijmans, Falk Huettmann, John R. Leathwick, Anthony Lehmann, Jin Li, Lucia G. Lohmann, Bette A. Loiselle, Glenn Manion, Craig Moritz, Miguel Nakamura, Yoshinori Nakazawa, Jacob McC. M. Overton, A. Townsend Peterson, Steven J. Phillips, Karen Richardson, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón, Stephen Williams, Mary S. Wisz and Niklaus E. Zimmermann, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-.
S06 - Stockwell D.R.B. 2006. Improving ecological niche models by data mining large environmental datasets for surrogate models Ecological Modelling 192: 188–196.
SP04 - Stockwell DRB, Peterson AT, 2002. Effects of sample size on accuracy of species distribution models Ecological Modelling 148 (1): 1-13.
Worldclim - Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.

5 responses so far ↓
David, thanks for the thoughtful review of this important data. I think your comment about the environmetal datasets is quite important. I work in the marine environment, with museum-type presence data that comes from a broad range of locations and depths. Climatologies of marine parameters often have a couple dozen depth levels, and the size and number of datasets quickly gets quite large. Some of the methods in the paper were more able than others to handle large numbers of environmental data layers, and your suggestion about using WhyWhere to weed out environmental parameters to a manageable number seems like a good one.
Thanks Karen. Modeling has always consisted of a stage where relevant variables were identified, then models developed and validated using those variables. There are still alternative methods to use at each stage, and whether one method can handle both stages remains an open question. I think it is important to remember there are two stages and not do only one or the other. Regards
3 Niche Modeling » Phillips et al. Maxent // Jun 28, 2006 at 1:43 pm
[...] The paper by S.J. Phillips, R.P. Anderson, R.E. Schapire A maximum entropy approach to species distribution modeling introduces for the first time the Maximum Entropy approach well known in machine learning. They also provide the Maxent software for predicting species distribution using Maxent, and evaluate against a well know method called GARP in predicting the distribution of two Neotropical mammals, a sloth Bradypus variegatus and a rodent Microryzomys minutus. The Maxent principle is to estimate the probability distribution, such as the spatial distribution of a species, that is most spread out subject to constraints such as the known observations of the species. The principal uses entropy as the means to generalizing about specific observations of presence of a species, and does not require or even incorporate absence points within the theoretical framework. Presence-only points are observations of the presence of a species. For a variety of reasons, absence of a species is not usually recorded. It would seem to be an advantage to have a framework that is based in presence-only points, the most common form of data in niche modeling. Usually we are having to shoe-horn methods developed for discriminating presence from absence points into working with presence-only points. The method does seem to perform well and will no doubt be increasingly used. There are some aspects of the way Maxent characterizes the problem of niche modeling that will seem surprising and make it a little difficult to grasp. Rather than working directly with environmental variables like temperature and precipitation as predictors, variables are transformed into feature vectors, and may be the mean of variables, their square, product with other variables, thresholds or binarizations of categorical variables. While claiming to be most similar to generalized linear models (GLM) in approach Phillips et al. state it is a ‘generative’ approach rather than a ‘discriminative’ method like GLM. This means that where Y is the response (probability of occurrence) and X are the inputs (feature vectors) it models Pr(X|Y) the probability of inputs given the response. A discriminative approach models Pr(Y|X) the probability of a response given inputs, which is what is required for prediction. Phillips states Maxent uses Bayes’ rule to get from P(X|Y) to P(Y|X), and in fact uses P(X|Y=1) as only occurrence points are used. This is one of the many things I would have liked to have seen this explained more but couldn’t find it in the paper. Another big difference is that Maxent estimates a probability across every pixel in the study area which means the probability in all pixels sums to 1. Rather than individual probabilities of 0.8, 0.9, etc representing the suitability of each pixel for a species, the probabilities are each very small. To get around this, Maxent assigns a “cumulative” probability to each pixel which is the sum of the probabilities for all pixels with lesser probability. The interpretation of this distribution is that the expected accuracy of a predicted distribution using cumulative probability threshold t, will omit approximately t% of test locations with minimum predicted area. For example, t=0.05 will omit 5% of occurrences while minimizing area, which is what we would want to represent a type of boundary on the distribution. Much is made in the paper of a subjective visual comparison of maps produced by Maxent and GARP, with the claim that Maxent predictions appeared to have more fine detail. It is not clear why. While they try variations on the only 10 ‘best subsets’ approach, they don’t seem check out the different ways of representing probability for the apparent lack of definition. Reference is also made to GARP testing the limit of storage, but exactly what information the program DesktopGARP writes to produce 20GB rather than 285MB in Maxent I don’t know. The GARP algorithm shouldn’t need a lot of temporary storage so it’s probably a detail of implementation that could be easily changed, and not inherent in the algorithm. Similarly, figures of 2hrs for 100 runs of GARP compared with 2.3hrs for one run of Maxent are quoted, so be prepared for long runs if using Maxent for repeated analyses. One of the experiments is to examine the effect of adding in categorical variable vegetation type, which is claimed to improve accuracy in Maxent but have little effect in GARP. A new implementations of the GARP algorithm in OpenModeller is supposed to handle categorical variables more efficiently, so this is another result that is probably specific to an implementation. Overfitting is reduced with a regularization principle equivalent to a Gibbs distribution with a constraint, the minimization of which encourages Maxent to focus on the most important feature vectors. Gibb’s distributions obey constraints such that in the normalized feature weights: 1 = Sum(exp(w f(x))) where f(x) are the feature vectors and w are their weights. Gibbs distributions are usually used to define the equilibrium probabilities of stationary microscopic states. That the most accurate model on independent data should be constrained by a Gibbs distribution is an interesting aside, but the relevance to niche modelling is not made clear. Phillips et al. throw in references to the theory of Convex duality, Gibbs distribution, and Bayes rule, but you would need a lot of background in machine learning to understand fully how the typical problems in niche modeling, such as bias, small samples, non-linear distributions and autocorrelation, would affect the method. I would personally have liked to have seen a lot more explanation of the methodology, not being up on statistical mechanics or machine learning, and less of the evaluation, much of which were subjective attempts to interpret the results in terms of species biology, and large and largely irrelevant tables (the ROC graphs would have sufficed). Far more extensive trials reviewed here have shown the method performs well. As I said here, “novel methods from Machine Learning continue to improve prediction”. The insight of the theoreticians, and input into niche modeling is greatly appreciated, as these results will no doubt help to propel niche modeling in the future. RSS feed for comments on this post. | TrackBack URI | | [...]
4 Niche Modeling » Comparison of Predictive Models of Species Distributions // Jul 3, 2006 at 1:33 pm
[...] Authors: Can Ozan Tan, Uygar Ozesmi, Meryem Beklioglu, Esra Per, Bahtiyar Kurt Comments: Submitted to Ecological Informatics This interesting study on arXiv (in review) compares some predictive models for species distribution not examined in the study of EG+06, reviewed in Novel methods continue to improve prediction of species’ distributions. They used nearest neighbor (k-NN, ARTMAP) and neural net methods (not evaluated in EG+06) and generalized linear models and discriminant analysis (LDA and QDA) (evaluated in EG+06). The GLM method is in common to both TO+06 and EG+06. They found: The methods considered k-NN, LDA, QDA, generalized linear models (GLM) feedforward multilayer backpropagation networks and pseudo-supervised network ARTMAP. For ecosystems involving time-dependent dynamics and periodicities whose frequency are possibly less than the time scale of the data considered, GLM and connectionist neural network models appear to be most suitable and robust, provided that a predictive variable reflecting these time-dependent dynamics included in the model either implicitly or explicitly. For spatial data, which does not include any time-dependence comparable to the time scale covered by the data, on the other hand, neighborhood based methods such as k-NN and ARTMAP proved to be more robust than other methods considered in this study. Both of the nearest neighbor methods performed better than GLM methods on bird breeding data sets. To the extent results are comparable, this would place them above GLM in the EG+06 study and possibly in the best performing techniques. The traditional neural net methods performed worse than GLM. The good results for neighbourhood based methods are encouraging. My favourite method at the moment, WhyWhere, uses a categorization heuristic related to a nearest neigbour approaches. I got into this method when simple categorization approaches gave superior performance over GLMs in Effects of sample size on accuracy of species distribution models. [...]
5 Niche Modeling » IPCC Fraud Solutions // Jun 29, 2008 at 5:48 am
[...] 1. The review must be systematic. The type evidence to be accessed is explicitly stated by the commissioning agency and the procedures adhered too, with a view to minimizing personal bias. 1. The review must be without conflict of interest. It must be done by people with nothing to gain from the promotion of specific studies. 3. The review must pay particular attention to the relative quality of evidence contributed by each study. In respect to the third point, a number of different systems have been set up. Adaptation of one of the most well known, the Oxford system, leads to a breakdown something like the following, highest level first: Level 1. Blind and randomized studies with publicly archived data and code. Level 2. Comprehensive and independently tested studies with publicly archived data and code. Level 3. Observational evidence and correlation studies with publicly archived data. Level 4. Theory, models, and case studies. Level 5. Expert opinion. The application of such a system to climate science is not to enforce it, but to use to to evaluate existing evidence in focussed systematic reviews. This would not be expensive, or require an overhaul or massive retraining of climate scientists and Michael Tobin fears. It would provide a positive incentive for studies to be structured in ways that have been proven to yield more reliable results, with less personal bias. It would help to build a climate of trust. For example, a review commission might stipulate that all evidence is to be level 3 and above, requiring at least publicly archived data. This would eliminate a great deal of studies cited in the IPCC review. It would however provide a strong incentive for archiving data the next time around. A blind trial of climate models need not be any more expensive than comparison trials as they are currently conducted. As an example, accuracy of a range of niche modeling methods were evaluated in a blind trial reported here. It would be recognized that the IPCC is just another review, and an unstructured and biased one at that. Its main in scope goal is to find a human influence on climate, and the range of reasons for climate change are out of scope. This creates a systematic bias against natural explanations for climate change. This is objective is clearly stated in its reward of a Nobel Peace Prize for: “for their efforts to build up and disseminate greater knowledge about man-made climate change, and to lay the foundations for the measures that are needed to counteract such change” So my solution is not one among the many I have seen that seem to tend to one extreme or the other, from cries of “fraud” to blind acceptance of IPCC as gospel truth. The solution is just to keep it in perspective, and for those who are financially impacted by the implications to conduct their own structured reviews of key components of the case, and let these be a guide to their policy decisions. [...]
Leave a Comment