Recent Comments
- Nick Stokes @ Radiative Equilibrium (Miskolczi Part 4)
- Chris Morris @ Drought Exceptional Circumstances Report MIA
- Jan Pompe @ Radiative Equilibrium (Miskolczi Part 4)
- stas peterson @ Greenhouse Heat Engine
- Neal J. King @ Radiative Equilibrium (Miskolczi Part 4)
- Jan Pompe @ Radiative Equilibrium (Miskolczi Part 4)
- Neal J. King @ Radiative Equilibrium (Miskolczi Part 4)
- Neal J. King @ Radiative Equilibrium (Miskolczi Part 4)
- Jan Pompe @ Radiative Equilibrium (Miskolczi Part 4)
- Neal J. King @ Radiative Equilibrium (Miskolczi Part 4)
Abouts
Blogroll
Literature
Subscribe
First Time At Niche Modeling?
This is a blog on the power of numeracy. My first book — Niche Modeling — is now in print.The first six chapters are tutorial topics in R programming and theoretical topics in niche modeling: functions, data, spatial, topology, environmental data collections, and examples. The last six chapters are about using niche modeling to detect errors: bias, autocorrelation, non-linearity, long term persistence, circularity and fraud - useful information for any biological modeler.
April 27, 2006
In Praise of Numeracy
Mathematical shapes can affect our lives and the decisions we make.
The hockey stick graph describing the earths average temperature over the last millennia has been the subject of a controversial debate over reliability of methods of statistical analysis.
The long tail is another new icon, described in a new book, developed in the Blogosphere, by Chris Anderson called “The Long Tail”:
Forget squeezing millions from a few megahits at the top of the charts. The future of entertainment is in the millions of niche markets at the shallow end of the bit stream. Chris Anderson explains all in a book called “The Long Tail”. Follow his continuing coverage of the subject on The Long Tail blog.
As explained in Wikipedia:
The long tail is the colloquial name for a long-known feature of statistical distributions (Zipf, Power laws, Pareto distributions and/or general Lévy distributions ). The feature is also known as “heavy tails”, “power-law tails” or “Pareto tails”. Such distributions resemble the accompanying graph. In these distributions a low frequency or low-amplitude population that gradually “tails off” follows a high frequency or high-amplitude population. In many cases the infrequent or low-amplitude events—the long tail, represented here by the yellow portion of the graph—can cumulatively outnumber or outweigh the initial portion of the graph, such that in aggregate they comprise the majority.(more…)
April 26, 2006
Novel methods continue to improve prediction of species’ distributions.
The recently published paper by Jane Elith and Catherine Graham et.al.”Novel methods improve prediction of species’ distributions from occurrence data” (EG06) is sure to be a landmark study in the field. EG06 compares 16 modeling methods using 226 well-surveyed species in 6 regions of the world. Measures of statistical skill on held back data show a spread from a wide range of methods including: the older methods such as BIOCLIM, DOMAIN, through GARP, GLM and GAM to the newer arrivals from machine learning MAXENT, BRT and community based method GDM, prompting the conclusion “novel methods improve prediction”. The work of a great many people is appreciated, as these results will no doubt be very helpful to many biodiversity modellers in the future.
Why novel methods work
EG06 attributes the success of the newer methods to representing complexity of the relationships of species to their environment.One feature that they all share in common is a high level of flexibility in fitting complex responsesThe same thing was found in the early 80’s when novel machine learning methods — particularly neural nets, decision trees (CART), and genetic algorithms (GARP) — were first used for species distribution modeling. When BIOCLIM and GLM were the only species distribution methods, these early experiments showed heuristic approaches from machine learning would benefit the field.
The opposing view still widely held is that approaches should be based strictly on ecological theory, such as using BIOCLIM to represent the Hutchinsonian niche. This view is valid too, if your primary aim is to evaluate the theory and not necessarily maximize accuracy. This is a familiar theme in ML — that real world performance requires heuristic complexity at the expense of theoretical elegance. Happily the species prediction problem has come to the attention of the leading edge researchers in present day ML, and both theory and practice will benefit from the interplay.
Historic progression of models
In addition to the complexity dimension, the range of statistical skill across methods also represents an historic progression. GARP in the early 80’s used genetic algorithms to combine the major methods of the time, BIOCLIM, GLM and surrogate into a multi-model rule-set. The strategy of using multiple models for prediction is also used in the highest performing method BST in EG06 (Boosted Regression Trees). Under ideal conditions the ensemble approach in GARP would be expected to be better than the worst of the methods it uses (BIOCLIM), but no better than the best of the methods (GLM), and this is shown in EG06. It is likely that other approaches with high performance such as boosted regression trees (BRT) have evolved experiences with from earlier regression tree algorithms such as classification and regression trees (CART).Unanswered issues
The major unanswered issue in species distribution modeling is environmental data selection. In EG06, selection of environmental data reflects the typical practices:The environmental data used for each region were determined according to their relevance to the species being modeled (Austin 2002) as determined by the data provider (Tables 1 and 3).EG06 does n



