WhyWhere 2.0 update

I received a number of emails suddenly about [tag]WhyWhere[/tag], and I thought I would answer them all here with an update on progress of the new version. This is my highest priority now, and should be available as beta in a week or so. The old version was too hard to maintain, being built on via a number of student postdocs over many years. The new version will be in [tag]R[/tag] and so have far fewer lines of code. It will also be more more consistent with subscription trends. It will consist of a small block of html code that you cut and paste into your web page. Then, it will (hopefully!) generate a dynamic output of the prediction of the best model so-far as it mines through the database for correlations. So of the questions I have been asked are below.

I was wondering if the dataset used with whywhere includes attributes of river variables. I am speaking about channel width, reach gradient, mean annual flow, etc. I am intersted in modeling potential species distributions in a riverine network. Any insight would be valuable…

The hydro1k data set is included. Alist of variables is at http://edc.usgs.gov/products/elevation/gtopo30/hydro/namerica.html

[tag]Biodiversity[/tag] section is a latest initiative and we want to use desktopGARP. It seems very useful for our research work. We are having problems in running it and understanding the results. We were hoping that if there was a detailed manual that could help us with more understanding of this software.

All questions regarding DesktopGARP should go to the list for the purpose at http://www.cria.org.br/mailman/listinfo/desktopgarp.

I would like to use WhyWhere for predicting coral species distributions. Since I’m not familiar with the software and whether the model can meet my needs, I would like to try the web version first. However, I cannot open the webpage to the web version of WhyWhere. Is the link removed? Where will I be able to get access to WhyWhere web version? Or must I download the window version?

I have unlinked the web server version of WhyWhere as I don’t intend to support it long-term, and will soon have a new server version. The desktop version can be downloaded and works fine. It does however require a few packages to be installed for use by the Perl modules. People have found this a bit of a hurdle.

Do you have a time estimate available for when the WhyWhere version 2.0 portal and FAQ and ToDo List documents will be available? As part of a project for class and as part of my masters work I am investigating the various models available. I am interested in trying the WhyWhere model out for the class project, but want to make sure I’ll have all of the necessary information in time to complete the project. If not everything will be available, I will likely do my class project with Desktop GARP instead.

My best estimate is a week or two. Sorry I can’t be more precise. I have had to move all my stuff onto a new server, but this will be a permanent home from now on. Version 2.0 should be much better and easier to use.

I would like to know if the GARP model is suitable for use on smaller areas (roughly 150 000 ha) and if so, would the pre-processed dataset of North America be usable in this case or would we have to create our own? Also, I’ve noticed that the section in the user’s manual on Climate Change is under construction and I was wondering if there has been any progress or anywhere else I might be able to learn a bit more about the model’s ability to function using climate change scenarios. Lastly, if possible, would you be able to let me know what the 0 of the parameters in the dataset layers for North America represent?

If you are interested in DesktopGARP go to the list above. To work at that scale you would need vegetation data from satellite. I have a few layers, the continuous fields dataset from Uni Maryland that will be available in WhyWhere. Other than that you would need to get your own. Climate change predictions are possible (by saving a model, then reapplying it to shifted variables). I have no shifted variables available in the WhyWhere database however.

AIG Article

The Australian Institute of Geoscientists News has published online my article “Reconstruction of past climate using series with red noise” on page 14. Many thanks to Louis Hissink the editor for the rapidity of this publication. It is actually a very interesting newsletter with articles on the IPCC, and a summary of the state of the hockey stick (or hokey stick). There are articles on the K-T boundary controversy and how to set up an exploration company.

Reconstructing the hokey stick with random data neatly illustrates the circular reasoning in a general context, showing the form of the hokey stick is essentially encoded in the assumptions and proceedures of the methodology. The fact that 20% of LTP series (or 40% if you count the inverted ones) correlate significantly with the temperature instrument record of the last 150 years illustrates that (1) 150 years is an inadequate constraint on possible models to base an extrapolation of over 1000 years, and (2) the propensity of analogs of natual series with LTP to exhibit ‘trendiness’ or apparent long runs that can be mistaken for real trends. And check back shortly for the code, I have been playing around with RE and R2 and trying some ideas suggested by blog readers to tighten things up.

With the hokey stick discredited from all angles, even within the paleo community itself with recent reconstructions of Esper and Moberg showing large variation in temperature over the last 1000 years, including temperatures on a par with the present day, one wonders why it is taking so long for the authors of the hokey stick to recant and admit natural climate variability. While the past variability of climate may or may not be important to the attribution debate, it is obviously important on the impacts side, as an indicator of the potential tolerances of most species.

Hurst, Joseph, colours and noises

Demetris Koutsoyiannis contributed the following excellent piece as a comment on a previous post. I have made it into a post to ensure it gets the widest distribution.

Hurst, Joseph, colours and noises: The importance of names in an important natural behaviour

“What’s in a name? That which we call a rose
By any other name would smell as sweet.

William Shakespeare, “Romeo and Juliet, Act 2 scene 2

Is the name given to a physical phenomenon or in a scientific concept (e.g. a mathematical object) really unimportant? Let us start with a characteristic example, the term “regression”. The term was coined by Frances Galton who studied biological data and noticed that the offspring population were closer to the overall mean size than the parent population. For example, sons of unusually short fathers have heights typically closer to the mean height than their fathers. Today we know that this does not manifest a peculiar biological phenomenon but a normal and global statistical behaviour. The slope of the least squares straight line of two variables x and y is r_xy * s_y / s_x, where s_x and s_y are the standard deviations of the variables and r_xy is the correlation coefficient. In the example of the height of fathers and sons, s_x = s_y, so the slope is precisely r_xy, which (by definition) is not greater than one; hence the “regression” towards the mean. Today no one has any problem with this generally accepted term, even though clearly it is not a good name. No one has problem to understand the statistical (rather than biological or physical) origin of the “regression” and its irrelevance with time: For example the fathers of exceptionally short people also tend to be closer to the mean than their sons. Just interchange y and x (and the axes in the graph) and you will have again another line whose slope (in the new graph) will be again r_xy, that is, not greater than unity. However, until people understood these simple truths, the improper term must have caused several fallacies (see Regression fallacies in the Wikipedia article “Regression toward the mean”, http://en.wikipedia.org/wiki/Regression_toward_the_mean).

Continue reading Hurst, Joseph, colours and noises

How to start a science blog (nice version)

You might have noticed the change in the URL for this site to http://www.landshape.org/enm. I have had to set up a site on web hoster and move the blog over as the old server couldn’t cope with the traffic. Here are some of my thoughts on blogs for others who might be interested in starting their own.

There are many reasons a scientist might start a blog:

  • Prepublication of work to enable review by others
  • Outreach to the general community
  • Dissemination of research notes
  • Provide a review of the literature
  • Advocate a position or idea
  • Facilitate project management
  • Make money

Of these the last is probably the most tricky, but I will say something about that too. After deciding to start a blog, the next question is how to do it. There are a range of possibilities available. Following are my notes on the experience.

Continue reading How to start a science blog (nice version)

Blogs on random temperature reconstruction

A new temperature reconstruction has certainly resonated with many people. Here is a summary of what some of the blogs have been saying, and my corrections of some small inaccuracies.

American Thinker wrote a very upbeat but over the top piece.

The scientific argument that humans have caused global warming – a major underpinning of the “Kyoto Protocols�? – suffered a major blow last week, with the publication of a new study. The implications have not yet spread very far beyond the rarified circles of specialists, but the gospel of “anthropogenic�? – human-caused – global warming has lost one of its intellectual foundations.

However, the article has not yet been through the rigor of publishing – but some preliminary results will be in the Australian Institute of Geologists newsletter next month.

Continue reading Blogs on random temperature reconstruction

Cross validation as a test of random reconstructions

To recap previous posts (http://www.climateaudit.org/?p=566), about replicating the cross-validation procedure used in MBH98 for reconstruction skill of randomly generated series on raw and filtered CRU temperatures. The RE statistic correctly indicated no skill for the reconstruction in both the raw and filtered temperature data. The R2 statistic indicated no skill on the raw temperature data and skill at predicting the filtered temperature data. The importance of these ‘tests’ is that they are the basis for accepting or rejecting a reconstruction. The question addressed is, are the tests using RE and R2 capable of discriminating between meaningful proxy data and a reconstruction developed using random data?

Continue reading Cross validation as a test of random reconstructions

RE of random reconstructions

To follow up on the last post, I have calculated the RE as well as the R2 statsitics for the reconstruction from the random series. The same approach was used, i.e. generate 1000 sequences with LTP, select those with positive slope and R2>0.1, calibrate on linear model, and average. Here is the reconstruction again, with the test and training periods marked with a horizontal dashed line (test period to the left, training to right of temperature values):

Continue reading RE of random reconstructions

R2 statistics for random reconstructions

As a follow-up om the previous post, I have examined the correlation statistics for the reconstruction of past climate from random series with red noise. I have tried to use the same approach as MBH98, where the model is tested over data for years held back from the main analysis and model development. Different intervals of years could be chosen, but in the case of MBH98, the model is trained on years 1901-1990 and tested on years 1856-1900. The distribution of R2 values are as follows:

Figure 1. The frequency distribution of R2 values for all series (trees) over the training interval in blue, and the test interval in red. The distribution of R2 before selection is shown by the solid line and after selection by the dashed line. Series are selected if the R2 value is greater than 0.1 and have a positive slope.

Continue reading R2 statistics for random reconstructions

A new temperature reconstruction

In honor of the National Research Council of the National Academies committee to study “Surface Temperature Reconstructions for the Past 1,000-2,000 Years” meeting at this moment, I offer my own climate reconstruction based on the methods blessed by dendroclimatology. The graph below shows reconstructed temperature anomolies over 2000 years, with the surface temperature measurements from 1850 from CRU as black dots, the individual series in blue and the climate reconstruction in black. I think you can see the similarity to other published reconstructions (see here), particularly the prominent ‘hockey-stick’ shape, the cooler temperatures around the 1500s and the Medieval Warm Period around the 1000s. What data did I use? Completely random sequences. Reconstruction methods from dendroclimatology will generate plausible climate reconstructions even on random numbers!

Continue reading A new temperature reconstruction

WhyWhere published in ecological modeling

The paper on WhyWhere entitled “Improving ecological niche models by data mining large environmental datasets for surrogate models” by David R.B. Stockwell, Ecological Modelling 192 (2006) 188–196 is finally available here. Note the source for the application is temporarily available here, due to a bad file on the main site.

The WhyWhere algorithm (and accompanying database of environmental data) developed from concern with two issues:

  • Is eliminating the large number of possible correlates for species distributions justified?
  • Is not considering a range of possible distributions of species responses to those variables justified?

It is easy to find justifications for the inverse of the questions, i.e. why certain variables such as annual averages of temperature and rainfall are included in models, and why certain distributions such as a bell shaped curve might be preferred, but is there any real reason for excluding alternatives, particularly when their value can be determined objectively by maximum accuracy or some other measure of utility?

In questioning these assumptions with the WhyWhere algorithm, it was found that many variables not usually included in ENMs, such as monthly averages of climate, do indeed maximize accuracy. At other times, surprising variables such as distribution of density of beef cattle provide the best model (Yellow Star Thistle). In addition, examples have been found of variables that maximize accuracy and do not have bell shaped distribution – such as asymptotic (the Brown Tree Snake) and bimodal distributions (house price increase). The conclusion is that the preferences for annual averages and unimodal distributions in most ENM studies are questionable, and may produce models of species distributions with less than maximum accuracy.

Scale invariance for Dummies

Below is an investigation of scale invariance or long term persistence (LTP) in time series including tree-ring proxies – the recognition, quantification and implications for analysis – drawn largely from Koutsoyiannis [2] (preprints available here). In researching this topic, I found a lot of misconceptions about LTP phenomena, such as LTP implying a long term memory process, and a lack of recognition of the implications of LTP. As to implications, the standard error of the mean of global temperatures at 30 data points is 4 times larger than the usual estimate for normal errors. Given that LTP is a fact of nature – attributed by Koutsoyiannis to the maximum entropy (ME) principle – this strongly suggests the H should be considered in all hypothesis testing.
Continue reading Scale invariance for Dummies

Predicting spatial patterns of house prices

Predicting real estate is like any other geospatial problem – all you need is data – e.g. see Zillow. If locations such as cities and their house prices or increases are correlated with environmental variables then a model can be developed. Here I address the question – what environmetal variables predict the increase in house prices in metro areas of the US? Applying WhyWhere (WW) to a topic other than predicting species produces some interesting results: the best predictor of areas with high or low price increases is precipitation (contrary to temperature as is usually thought), and the response shows a strongly bimodal distribution. This example illustrates the generality of the WW approach to environmental niche modeling (ENM).

WW can search almost 1000 available environmental data images such as the precipitation variable above, create a model based on the best one to three and display the predicted probabilities on maps colored according to the adjacent legend.

Continue reading Predicting spatial patterns of house prices

Peer-censorship and scientific fraud

The major scientific journals are often regarded as the touchstones of scientific truth. However, their reputation has been tarnished with yet another major scientific fraud unfolding over South Korean researcher Hwang Woo-suk’s peer-reviewed and published Stem Cell research. Should the publication of these results be viewed as simple ‘mistakes’, a crime by a deviant individual, or a broader conspiracy aided by lax reviewing and journal oversight? Blogs were apparently instrumental in uncovering the inconsistencies in Hwangs publications. Here I look at peer-censorship in environmental sciences and its role in concealing scientific waste and fraud, and uncover the emerging solutions from pre-print archives and blogs.

Continue reading Peer-censorship and scientific fraud

Presence absence or presence only?

There are two main forms of data about species occurrences, lists of locations where a species has been found, called presence-only (P) data, and lists of locations where species are both present and absent (PA). In developing ENMs, PA data are often said to be preferable to P data (e.g. Austin and Meyers 1996), and some have shown empirical results supporting this view (e.g. Broto et al. 2004). But is there an intrinsic advantage to PA data?

Continue reading Presence absence or presence only?

Sigmoids in the mist

Have become interested in checking out dendroclimatology from the ENM point of view – particularly evaluating the model used for functional responses of alpine trees to temperature. All studies in Briffa et al. 2001 (figure below) invariably use a linear model, OLS fit of the proxy to temperature be it tree ring width (TRW) or density (MXD). It is of course not possible for tree growth to increase indefinitely with temperature increases – it has to be limited. The obvious choice for a more accurate model of tree response is a sigmoidal curve. Analysis follows…

Plate 3 from Briffa et al. [JGR 2001]. Original Caption: Plate 3. Comparison of six large-scale reconstructions, all recalibrated with linear regression against the 1881-1960 mean April-September observed temperature averaged over land areas north of 20N. All series have been smoothed with a 50-year Gaussian-weighted filter and are anomalies from 1961-1990 mean. Observed temperature for 1871-1997 (black) from Jones et al. (1999);

Continue reading Sigmoids in the mist

Archive of environmental data using Storage Resource Broker

by David Stockwell and Bing Zhu for SRB Workshop, February 2-3, 2006, San Diego, due Dec 15th

Here we describe the use of the Storage Resource Broker (SRB) to support new data intensive approaches to Environmental Niche Modeling (ENM) by providing access to cropped images from a remote SRB data store of almost 1000 global coverage data sets. The basic architecture of the system is illustrated on the figure below.

Figure 1. Illustration of the components and operation of the SRB WhyWhere data archive for ecological niche modeling. A large set of images and meta data are stored in a central archive. The client directs the server to crop an image in the archive using a server-side proxy operation. The cropped image is copied to the local directory and scaled by the client to the resolution required for the prediction algorithm. Illustrated is a prediction of a North American bird, the Cerulean Warbler.

Continue reading Archive of environmental data using Storage Resource Broker

Predicting spatial pattern of global warming with WhyWhere

The WhyWhere system has integrated a lot of environmental data sets of many different kinds with a robust method. This allows you to search for for correlates of any geographic points, not just species. The user does not have to prepare these, just enter the coordinates. I thought it would be interesting to see what correlated with recent temperature anomalies. We all know average annual temperatures have increased in the last 30 years, but the spatial pattern of those increases is less well understood.

Just as a test, I downloaded the PC version of WhyWhere onto a new machine and see what problems I might encounter and record the results. Here are the steps and results…

Continue reading Predicting spatial pattern of global warming with WhyWhere

Advantages of WhyWhere

Some have been asking for an explanation of WhyWhere and how it fits in relation to other methods, particularly GARP. Although the details are in the paper, they are in a more academic from and I thought I would try to explain it here.


Here
is a nice schematic prepared by Jean Tate describing the basic one dimensional model output from a run on the Yellow Star Thistle, illustrated as a frequency histogram. A 2D model would be similar, just columns with two environmental dimensions

Continue reading Advantages of WhyWhere

Preprint of Lifemapper

I have just placed the Lifemapper paper onto the arXiv pre-print archive here.

The use of the GARP genetic algorithm and internet grid computing in the Lifemapper world atlas of species biodiversity

Authors: David R.B. Stockwell, James H. Beach, Aimee Stewart, Gregory Vorontsov, David Vieglais, Ricardo Scachetti Pereira
Comments: 17 pages, 4 figures, in press at Ecological Modelling
Subj-class: Quantitative Methods; Other

Lifemapper (this http URL) is a predictive electronic atlas of the Earth’s biological biodiversity. Using a screensaver version of the GARP genetic algorithm for modeling species distributions, Lifemapper harnesses vast computing resources through ‘volunteers’ PCs similar to SETI@home, to develop models of the distribution of the worlds fauna and flora. The Lifemapper project’s primary goal is to provide an up to date and comprehensive database of species maps and prediction models (i.e. a fauna and flora of the world) using available data on species’ locations. The models are developed using specimen data from distributed museum collections and an archive of geospatial environmental correlates. A central server maintains a dynamic archive of species maps and models for research, outreach to the general community, and feedback to museum data providers. This paper is a case study in the role, use and justification of a genetic algorithm in development of large-scale environmental informatics infrastructure.

Preprint of WhyWhere

I have just posted the WhyWhere paper to arXiv here.

Improving ecological niche models by data mining large environmental datasets for surrogate models

Authors: David R.B. Stockwell
Comments: 16 pages, 4 figures, to appear in Ecological Modelling
Subj-class: Quantitative Methods

WhyWhere is a new ecological niche modeling (ENM) algorithm for mapping and explaining the distribution of species. The algorithm uses image processing methods to efficiently sift through large amounts of data to find the few variables that best predict species occurrence. The purpose of this paper is to describe and justify the main parameterizations and to show preliminary success at rapidly providing accurate, scalable, and simple ENMs. Preliminary results for 6 species of plants and animals in different regions indicate a significant (p<0.01) 14% increase in accuracy over the GARP algorithm using models with few, typically two, variables. The increase is attributed to access to additional data, particularly monthly vs. annual climate averages. WhyWhere is also 6 times faster than GARP on large data sets. A data mining based approach with transparent access to remote data archives is a new paradigm for ENM, particularly suited to finding correlates in large databases of fine resolution surfaces. Software for WhyWhere is freely available, both as a service and in a desktop downloadable form from the web site this http URL

GARP and numbers of data points

There are a number of issues that arise in analysis of spatial data points, not enough data and spatial auto-correlation being two often raised.

As a general principle, the external accuracy (on the test set) will increase asymptotically as number of data increases, and the internal accuracy (on the training set) will decrease asymptotically as the number of data increases. We are most often interested in external accuracy, so more data is better, but the returns are diminishing. There are also plenty of results to show that the shape of the curve varies a lot between methods and species, but its not clear why.

Spatial autocorrelation occurs when the occurrences are sufficiently close together that they essentially duplicate each other. The main problem here is that it leads to spuriously high indications of significance as the assumption generally is the points are independent. This could create problems for example, if you were to develop a logistic regression using stepwise method, and variables were at different scales, or resolutions. Then I would think the finer scale variables would produce misleadingly high correlations. What GARP does is map all the environmental layers and the occurrences into a grid of the same extent and resolution. This results in a reduction in the number of data points, with more being eliminated at coarser resolutions. I believe this results in a more authentic representation of the information contained in the presence points and reduces, though not eliminates, spatial autocorrelation. This is completely built-in to the system, though is somewhat independent of the actual GARP algorithm. Its just one of those things I added to try to achieve a more general purpose system.