I received a number of emails suddenly about [tag]WhyWhere[/tag], and I thought I would answer them all here with an update on progress of the new version. This is my highest priority now, and should be available as beta in a week or so. The old version was too hard to maintain, being built on via a number of student postdocs over many years. The new version will be in [tag]R[/tag] and so have far fewer lines of code. It will also be more more consistent with subscription trends. It will consist of a small block of html code that you cut and paste into your web page. Then, it will (hopefully!) generate a dynamic output of the prediction of the best model so-far as it mines through the database for correlations. So of the questions I have been asked are below.
The Australian Institute of Geoscientists News has published online my article “Reconstruction of past climate using series with red noise” on page 14. Many thanks to Louis Hissink the editor for the rapidity of this publication. It is actually a very interesting newsletter with articles on the IPCC, and a summary of the state of the hockey stick (or hokey stick). There are articles on the K-T boundary controversy and how to set up an exploration company.
This week I am posting another quiz, although no-one has yet solved the Spaghetti Graph Quiz. This one, suggested Demetris Koutsoyiannis may require some statistical analysis to solve. I have plotted the points up, and converted them to an R statement below.
Demetris Koutsoyiannis contributed the following excellent piece as a comment on a previous post. I have made it into a post to ensure it gets the widest distribution.
Hurst, Joseph, colours and noises: The importance of names in an important natural behaviour
“What’s in a name? That which we call a rose
By any other name would smell as sweet.
William Shakespeare, “Romeo and Juliet, Act 2 scene 2
Is the name given to a physical phenomenon or in a scientific concept (e.g. a mathematical object) really unimportant? Let us start with a characteristic example, the term “regression”. The term was coined by Frances Galton who studied biological data and noticed that the offspring population were closer to the overall mean size than the parent population. For example, sons of unusually short fathers have heights typically closer to the mean height than their fathers. Today we know that this does not manifest a peculiar biological phenomenon but a normal and global statistical behaviour. The slope of the least squares straight line of two variables x and y is r_xy * s_y / s_x, where s_x and s_y are the standard deviations of the variables and r_xy is the correlation coefficient. In the example of the height of fathers and sons, s_x = s_y, so the slope is precisely r_xy, which (by definition) is not greater than one; hence the “regression” towards the mean. Today no one has any problem with this generally accepted term, even though clearly it is not a good name. No one has problem to understand the statistical (rather than biological or physical) origin of the “regression” and its irrelevance with time: For example the fathers of exceptionally short people also tend to be closer to the mean than their sons. Just interchange y and x (and the axes in the graph) and you will have again another line whose slope (in the new graph) will be again r_xy, that is, not greater than unity. However, until people understood these simple truths, the improper term must have caused several fallacies (see Regression fallacies in the Wikipedia article “Regression toward the mean”, http://en.wikipedia.org/wiki/Regression_toward_the_mean).
You might have noticed the change in the URL for this site to http://www.landshape.org/enm. I have had to set up a site on web hoster and move the blog over as the old server couldn’t cope with the traffic. Here are some of my thoughts on blogs for others who might be interested in starting their own.
There are many reasons a scientist might start a blog:
Prepublication of work to enable review by others
Outreach to the general community
Dissemination of research notes
Provide a review of the literature
Advocate a position or idea
Facilitate project management
Make money
Of these the last is probably the most tricky, but I will say something about that too. After deciding to start a blog, the next question is how to do it. There are a range of possibilities available. Following are my notes on the experience.
A new temperature reconstruction has certainly resonated with many people. Here is a summary of what some of the blogs have been saying, and my corrections of some small inaccuracies.
The scientific argument that humans have caused global warming – a major underpinning of the “Kyoto Protocols�? – suffered a major blow last week, with the publication of a new study. The implications have not yet spread very far beyond the rarified circles of specialists, but the gospel of “anthropogenic�? – human-caused – global warming has lost one of its intellectual foundations.
However, the article has not yet been through the rigor of publishing – but some preliminary results will be in the Australian Institute of Geologists newsletter next month.
Here is the ’spaghetti graph’ of a number of prominent reconstructions, with two-sigma confidence interval. The CRU calibration temperatures are the solid black line. Can you find the random reconstruction? (Thanks to Steve McIntyre at http://www.climateaudit.org/?p=566 for recon data.)
Today I am reporting more results of reconstructing past climates with randomly generated sequences (http://www.climateaudit.org/?p=566). Here are a few experiments to identify the critical components of the dendroclimatology methodology. I record the skill of reconstruction with: different types of series (i.i.d., alternating means and fractional differencing), and dropping each component of the methodology in turn (positive slope, positive correlation, calibration with inverse linear model).
Random Series
Some alternatives for generating random series are: independent and identically distributed errors (called i.i.d), and two ways of generating series with ‘red noise’ or long term persistence (LTP): alternating means and fractional differencing. An example each series with the CRU temperature data overlaid are below.
Figure 1. Three random series generated to simulate CRU temperatures over 2000 years. The i.i.d. series with a standard deviation equal to the CRU temperatures. Parameters for altmeans were arbitrarily chosen, while parameters for fracdiff were calibrated using the R fracdiff package. Note the i.i.d is least realistic, altmeans is similar with some artifactual ‘jumps’, while the fracdiff is very similar to temperatures.
To recap previous posts (http://www.climateaudit.org/?p=566), about replicating the cross-validation procedure used in MBH98 for reconstruction skill of randomly generated series on raw and filtered CRU temperatures. The RE statistic correctly indicated no skill for the reconstruction in both the raw and filtered temperature data. The R2 statistic indicated no skill on the raw temperature data and skill at predicting the filtered temperature data. The importance of these ‘tests’ is that they are the basis for accepting or rejecting a reconstruction. The question addressed is, are the tests using RE and R2 capable of discriminating between meaningful proxy data and a reconstruction developed using random data?
To recap previous posts (http://www.climateaudit.org/?p=566), about replicating the cross-validation procedure used in MBH98 for reconstruction skill of randomly generated series on raw and filtered CRU temperatures. The RE statistic correctly indicated no skill for the reconstruction in both the raw and filtered temperature data. The R2 statistic indicated no skill on the raw temperature data and skill at predicting the filtered temperature data. The importance of these ‘tests’ is that they are the basis for accepting or rejecting a reconstruction. The question addressed is, are the tests using RE and R2 capable of discriminating between meaningful proxy data and a reconstruction developed using random data?
To follow up on the last post, I have calculated the RE as well as the R2 statsitics for the reconstruction from the random series. The same approach was used, i.e. generate 1000 sequences with LTP, select those with positive slope and R2>0.1, calibrate on linear model, and average. Here is the reconstruction again, with the test and training periods marked with a horizontal dashed line (test period to the left, training to right of temperature values):
As a follow-up om the previous post, I have examined the correlation statistics for the reconstruction of past climate from random series with red noise. I have tried to use the same approach as MBH98, where the model is tested over data for years held back from the main analysis and model development. Different intervals of years could be chosen, but in the case of MBH98, the model is trained on years 1901-1990 and tested on years 1856-1900. The distribution of R2 values are as follows:
Figure 1. The frequency distribution of R2 values for all series (trees) over the training interval in blue, and the test interval in red. The distribution of R2 before selection is shown by the solid line and after selection by the dashed line. Series are selected if the R2 value is greater than 0.1 and have a positive slope.
In honor of the National Research Council of the National Academies committee to study “Surface Temperature Reconstructions for the Past 1,000-2,000 Years” meeting at this moment, I offer my own climate reconstruction based on the methods blessed by dendroclimatology. The graph below shows reconstructed temperature anomolies over 2000 years, with the surface temperature measurements from 1850 from CRU as black dots, the individual series in blue and the climate reconstruction in black. I think you can see the similarity to other published reconstructions (see here), particularly the prominent ‘hockey-stick’ shape, the cooler temperatures around the 1500s and the Medieval Warm Period around the 1000s. What data did I use? Completely random sequences. Reconstruction methods from dendroclimatology will generate plausible climate reconstructions even on random numbers!
The paper on WhyWhere entitled “Improving ecological niche models by data mining large environmental datasets for surrogate models” by David R.B. Stockwell, Ecological Modelling 192 (2006) 188–196 is finally available here. Note the source for the application is temporarily available here, due to a bad file on the main site.
Below is an investigation of scale invariance or long term persistence (LTP) in time series including tree-ring proxies – the recognition, quantification and implications for analysis – drawn largely from Koutsoyiannis [2] (preprints available here). In researching this topic, I found a lot of misconceptions about LTP phenomena, such as LTP implying a long term memory process, and a lack of recognition of the implications of LTP. As to implications, the standard error of the mean of global temperatures at 30 data points is 4 times larger than the usual estimate for normal errors. Given that LTP is a fact of nature – attributed by Koutsoyiannis to the maximum entropy (ME) principle – this strongly suggests the H should be considered in all hypothesis testing. Read the rest of this entry…
Predicting real estate is like any other geospatial problem – all you need is data – e.g. see Zillow. If locations such as cities and their house prices or increases are correlated with environmental variables then a model can be developed. Here I address the question – what environmetal variables predict the increase in house prices in metro areas of the US? Applying WhyWhere (WW) to a topic other than predicting species produces some interesting results: the best predictor of areas with high or low price increases is precipitation (contrary to temperature as is usually thought), and the response shows a strongly bimodal distribution. This example illustrates the generality of the WW approach to environmental niche modeling (ENM).
WW can search almost 1000 available environmental data images such as the precipitation variable above, create a model based on the best one to three and display the predicted probabilities on maps colored according to the adjacent legend.
The major scientific journals are often regarded as the touchstones of scientific truth. However, their reputation has been tarnished with yet another major scientific fraud unfolding over South Korean researcher Hwang Woo-suk’s peer-reviewed and published Stem Cell research. Should the publication of these results be viewed as simple ‘mistakes’, a crime by a deviant individual, or a broader conspiracy aided by lax reviewing and journal oversight? Blogs were apparently instrumental in uncovering the inconsistencies in Hwangs publications. Here I look at peer-censorship in environmental sciences and its role in concealing scientific waste and fraud, and uncover the emerging solutions from pre-print archives and blogs.
There are two main forms of data about species occurrences, lists of locations where a species has been found, called presence-only (P) data, and lists of locations where species are both present and absent (PA). In developing ENMs, PA data are often said to be preferable to P data (e.g. Austin and Meyers 1996), and some have shown empirical results supporting this view (e.g. Broto et al. 2004). But is there an intrinsic advantage to PA data?
Have become interested in checking out dendroclimatology from the ENM point of view – particularly evaluating the model used for functional responses of alpine trees to temperature. All studies in Briffa et al. 2001 (figure below) invariably use a linear model, OLS fit of the proxy to temperature be it tree ring width (TRW) or density (MXD). It is of course not possible for tree growth to increase indefinitely with temperature increases – it has to be limited. The obvious choice for a more accurate model of tree response is a sigmoidal curve. Analysis follows…
Plate 3 from Briffa et al. [JGR 2001]. Original Caption: Plate 3. Comparison of six large-scale reconstructions, all recalibrated with linear regression against the 1881-1960 mean April-September observed temperature averaged over land areas north of 20N. All series have been smoothed with a 50-year Gaussian-weighted filter and are anomalies from 1961-1990 mean. Observed temperature for 1871-1997 (black) from Jones et al. (1999);
by David Stockwell and Bing Zhu for SRB Workshop, February 2-3, 2006, San Diego, due Dec 15th
Here we describe the use of the Storage Resource Broker (SRB) to support new data intensive approaches to Environmental Niche Modeling (ENM) by providing access to cropped images from a remote SRB data store of almost 1000 global coverage data sets. The basic architecture of the system is illustrated on the figure below.
Figure 1. Illustration of the components and operation of the SRB WhyWhere data archive for ecological niche modeling. A large set of images and meta data are stored in a central archive. The client directs the server to crop an image in the archive using a server-side proxy operation. The cropped image is copied to the local directory and scaled by the client to the resolution required for the prediction algorithm. Illustrated is a prediction of a North American bird, the Cerulean Warbler.
The WhyWhere system has integrated a lot of environmental data sets of many different kinds with a robust method. This allows you to search for for correlates of any geographic points, not just species. The user does not have to prepare these, just enter the coordinates. I thought it would be interesting to see what correlated with recent temperature anomalies. We all know average annual temperatures have increased in the last 30 years, but the spatial pattern of those increases is less well understood.
Just as a test, I downloaded the PC version of WhyWhere onto a new machine and see what problems I might encounter and record the results. Here are the steps and results…
Some have been asking for an explanation of WhyWhere and how it fits in relation to other methods, particularly GARP. Although the details are in the paper, they are in a more academic from and I thought I would try to explain it here.
Here is a nice schematic prepared by Jean Tate describing the basic one dimensional model output from a run on the Yellow Star Thistle, illustrated as a frequency histogram. A 2D model would be similar, just columns with two environmental dimensions