To recap previous posts (http://www.climateaudit.org/?p=566), about replicating the cross-validation procedure used in MBH98 for reconstruction skill of randomly generated series on raw and filtered CRU temperatures. The RE statistic correctly indicated no skill for the reconstruction in both the raw and filtered temperature data. The R2 statistic indicated no skill on the raw temperature data and skill at predicting the filtered temperature data. The importance of these ‘tests’ is that they are the basis for accepting or rejecting a reconstruction. The question addressed is, are the tests using RE and R2 capable of discriminating between meaningful proxy data and a reconstruction developed using random data?

I thought I would try another type of cross-validation procedure. This time the test data (validation) are selected at random from the years in the temperature series in the same proportion as the previous test, and those selected data are deleted from the temperatures in the training set (calibration). I also ran fewer simulations as it was more time consuming.

1. Run 20 times
2. Generated 100 random sequences of length 2000 (years)
3. Selected those sequences with positive slope and R2 > 0.1
4. Calibrated the sequence using the inverse of the linear model fit to that sequence
5. Smoothed the sequences using a 50 year Gaussian filter
6. Averaged the sequences
5. Calculated the R2 and RE statistic against raw and filtered temperature data on training and test sets.

The results of R2 and RE are in the table below. The mean number of sequences selected was 16.95 out of 100. The first thing to notice is that both statistics appear to indicate skill for the reconstruction on the cross-validation test data (using significance levels in MBH98 of RE>0, R2>0.2). This shows that tests on held back data do not necessarily eliminate reconstructions generated on random data. Secondly, it is interesting to notice that the RE statistic is much more variable than the R2 statistic in its results, with a standard deviation as great as the mean.

Case R2 s.d. RE s.d.
Training period CRU~recon 0.50 ±0.07 0.17 ±0.34
Test period CRU~recon 0.51 ±0.09 0.22 ±0.34
Training period CRUgs~recon 0.86 ±0.06 0.25 ±0.37
Test period CRUgs~recon 0.87 ±0.09 0.26 ±0.35

Conclusions

While this is only one approach to cross-validation on held-back data, it is apparent that such tests are not necessarily adequate to determine if a reconstruction is based on random data. In the language of Popper, they do not constitute a severe test, in that significance of the R2 and RE statistics is not unlikely in the case of reconstructions generated at random. Using the same language, the general approach of selecting and calibrating series based on observed sensitivity to temperature does not allow us to distinguish a meaningful from a random reconstruction. R2 and RE cross-validation statistics are not reliable rejectors of random series. Another way, the probability of finding unusual 20th century warming with this approach is one, whatever data is used in the reconstruction. In other words, a circular argument.

This does not look good for those studies that claim tree-ring proxies support the case for unusual 20th century warming.