CSIRO Data Policy: Go Pound Sand

The Intellectual Property card was played today so I cannot verify the statistical significance (or otherwise) of the Drought Exceptional Circumstances Report. But I found out enough about the statistical tests (performed after publication of the report due to my promptings) to determine that autocorrelation in the temperature series was probably not taken into account.

If this is the case, then it is highly likely the confidence intervals were grossly underestimated and so it is also likely that only one or two regions (SWWA) show statistically significant increase in predicted droughts, not 3 or 4 as claimed by the authors. I am more confident in my original assessment that the results show no significant increase in drought due to greenhouse warming in almost all regions of Australia

As an aside, it is normal practice, and in fact a requirement of publication in major scientific journals, that scientists document a good faith attempt to resolve points of contention prior to submitting a comment or request for correction. I think this is a good policy, as it avoids filling the literature with pointless disputes that could have been resolved between the disputants. Often the erring party issues a correction or withdraws the paper altogether (as Xian-Jin Li recently did with a faulty proof of the Riemann Hypothesis).

Drought Exceptional Circumstances funding is massive, so getting it right is very important. The client should be confident the interpretation of the results are free of researcher bias. Payments to farmers involve billions of tax-payer dollars and government has a duty of diligence to ensure policy is based on statistically sound information. Then there is the reputation of CSIRO (Australia’s NASA equivalent) and the public interest at stake.

To date I am very happy with Kevin’s promptness, and understand he may be constrained by CSIRO IP policy. If anyone has any specific information on this please let me know. However, my suspicions were raised when the report only quoted increases in mean values, and did not disclose whether these results were significant or not. His initial inquiry into my concerns did not reveal a problem. I am quite happy to be proved wrong on this, but I still think there is a problem. CSIRO data policy is not helping resolve the issue.

Below I requested the data used in previous tests and more details of the statistical tests used.

Dear Kevin,

Thank you for your explanation and summary of the results of your
significance tests. Sweeping other issues to the side, I would simply like to check the
significance of your results of increasing droughts in Australia. To do this I think
it would be sufficient to have:

1. The individual 13 values for areal % used to obtain each of the
mean and extreme values in tables 4, 7 and 9.
2. The data you used in the significance tests you quote below.
Delimited text files are best.
3. A description of the method you used to determine your significance.

I am assuming that the return period is a deterministic function of
areal % and so additional tests of significance will be redundant. If not, the respective data
for return period would also be of interest.

The results you quote below were interesting and I would like to
resolve any conflicting results that arise.
I note that your quoted significances reconcile with your claims that
“more declarations would be likely, and over larger areas, in the SW,
SWWA and Vic&Tas regions, with little
detectable change in the other regions.”

Many thanks in advance.

Dear David,

I’m not able to hand over the data from the 13 models, due to restrictions on Intellectual Property, but I can describe the methods used to determine statistical significance.

Dewi Kirono says:

· I have used a number of statistical tests (parametric and non-parametric) and found that most of them show agreement. I used the 5% significance level. One marginal case was the change in percentage area for exceptionally low rainfall in NSW, in which the T-test was insignificant at the 5% level while Kolmogorov-Smirnov test was significant 5% level. I feel the non-parametric test is more objective since it doesn’t assume a Normal distribution.

· For the percentage area (temp, rain), the 13-model-mean sample is the 108 yr time series for 1900-2007 and the 31 yr time series for 2010-2040. For percentage area (soil moisture), the sample is the 50 yr time series for 1957-2006 and the same 50 yr time series modified for a period centred on 2030

· For the frequency (temp and rain), the sample is the number of models (13) as each period (i.e. 1900-2007 and 2010-2040) only produces one return period value.

· For soil moisture frequency, I cannot perform the test as we only have one value for the obs (1957-2006).

· At the moment I’ve only applied the tests to the “mean” data not the “90th” and 10th” percentiles. This is because we cannot do that for soil moisture and because we deal with lots of zero values for the 10th percentile.

Regards

Kevin Hennessy

I then asked for further clarification of how the statistical tests were performed,
and asked again for the data.

Further explanation of the statistical tests reveals that they consisted simply of comparison of the means
for two time periods, where % area in individual years were the single data points. This simple
test assumes the data points are independent, but due to autocorrelation is an unjustified assumption.
Failing to account for autocorrelation grossly overestimates the power of the test to detect significant
differences (see my post Scale Invariance for Dummies or Chapter 10 of my book). Also see the results of Breusch and Vahid 2008 from the Draft Garnaut Report (reviewed here), where t-test scores for rate of temperature increase dropped from more than 4 to less than 2 when autocorrelation was taken into account.

I don’t know the exact autocorrelation, as I can’t get the data, but the temperature and rainfall variables that produce it have very high autocorrelation (or ‘bursty’) behaviour, and so these data must inherit that character.

Dear David,

Answers to your questions are embedded in the email below.

Regards

Kevin Hennessy



—–Original Message—–

From: David Stockwell [mailto:davids99us@gmail.com]

Sent: Tuesday, 15 July 2008 10:37 AM

To: Hennessy, Kevin (CMAR, Aspendale)

Subject: Re: Exceptional circumstances report supplementary information

Dear Kevin,

Thank you for relaying the description of the significance tests.
Just to be clear on what you did:

Was the standard deviation or the standard error of
the 13 model averages over the future periods used to determine
the significance of the decreasing areal % of rainfall and soil
moisture?

No. For rainfall and temperature, the tests were performed to assess
differences between means from two groups: group A (1900-2007) VS group
B (2010-2040) for temperature and rainfall. Thus, groups A and B had 108
and 31 data points, respectively. For soil moisture, group A (1957-2006)
and group B (50 yrs centred on 2030) both had 50 data points. We also
did a quick test of whether the results were sensitive to treating the
13 models separately or as multi-model means, and the answer is no.

Was the 108 yr time series for 1900-2007 one or 13 points?

One point (N = 108). See above.

Was the 31 yr time series for 2010-2040 13 points and used to get the
SD?

One point (N = 31). Yes.

Was a one or two-tailed t-test applied?

Both. They suggest the same conclusions, either above or below the 5%
significance level.

IMO the 13 values of model predictions are summaries of output that
are necessary to make a determination of the quality of evidence in
your report. Whether they are covered by IP or not, it would be in
the best interests of science for you to allow them to be used in an
independent check.

As you might be aware, using the minimal data provided in your report,
I determined that only one region showed significance for % area
exceptional low rainfall (SWWA) and only two areas showed
significance soil moisture (SWWA and Vic&Tas).

http://landshape.org/enm/dought-exceptional-circumstances-review/

This is fewer than the 3 areas and 4 areas where you claim
significance (SW and NSW being the additional areas). I would like
to reconcile this difference and the quickest way would be for me to
check it with the data.

Thanks again for your cooperation up to this point, and I hope you
will reconsider your decision regarding the data.

As indicated in my previous email, we are not at liberty to distribute
the data due to IP limitations. We have checked our data and believe the
results and conclusions are correct.

  • Bernie

    David:
    Do they actually state the degrees of freedom in the t or F-test? Can you use that to figure out how much of an issue there is?

  • Bernie

    David:
    Do they actually state the degrees of freedom in the t or F-test? Can you use that to figure out how much of an issue there is?

  • http://landshape.org/enm admin

    Bernie, I think not. One way of looking at it is that instead of the usual formula for standard error of IID:

    SE = σ/n0.5

    The effective n is reduced by the Hurst exponent H to give a standard error of:

    SE = σ/n1-H

    Hurst exponent is a measure of the long term persistence H or ‘trendiness’ and in climate data is around 0.9.

  • http://landshape.org/enm admin

    Bernie, I think not. One way of looking at it is that instead of the usual formula for standard error of IID:

    SE = σ/n0.5

    The effective n is reduced by the Hurst exponent H to give a standard error of:

    SE = σ/n1-H

    Hurst exponent is a measure of the long term persistence H or ‘trendiness’ and in climate data is around 0.9.

  • John Baltutis

    Maybe you should be asking the DOE’s Office of Science for the modeling dataset.

    From the report’s acknowledgements:

    We acknowledge the modelling groups, the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and the
    WCRP’s Working Group on Coupled Modelling for their roles in making available the WCRP CMIP3 multi-model dataset.
    Support of this dataset is provided by the Office of Science, US Government Department of Energy.

  • John Baltutis

    Maybe you should be asking the DOE’s Office of Science for the modeling dataset.

    From the report’s acknowledgements:

    We acknowledge the modelling groups, the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and the
    WCRP’s Working Group on Coupled Modelling for their roles in making available the WCRP CMIP3 multi-model dataset.
    Support of this dataset is provided by the Office of Science, US Government Department of Energy.

  • Ian Castles

    David,

    I agree with your comment on Climate Audit that Kevin Hennessy could be acting on advice in invoking Intellectual Property Rights in order to avoid providing you with the data required to verify the statistical significance of the CSIRO findings ( http://www.climateaudit.org/?p=3267 )

    In an email of 16 March 2006 I advised Kevin Hennessy of an error in a paper of which he had been a co-author (Whetton et al, December 2005, “Australian climate change projections for impact assessment and policy application: a review”, CSIRO Marine and Atmospheric Research Paper 001). The paper stated that the IPCC scenarios were ‘deliberately constructed to be equally plausible’ (p. 33). I was pleased to receive Kevin’s response on 29 March in which he wrote “I agree with you [Castles]. The error will be corrected”. This email was copied to seven of Kevin’s CSIRO colleagues: Greg Ayers, Chris Mitchell, Penny Whetton, Ian Watterson, Kathleen McInnes, Paul Holper and Simon Torok.

    After more than two years, the error has still not been corrected. I think that the reason for CSIRO’s unwillingness to do this is that the paper in which the statement appeared had already been cited as “In prep.” in the First Order draft of the IPCC Report, circulated on 13 August 2005. This makes it difficult for the CSIRO to admit its error.

    On 11 October 2006, Roger Jones (another co-author of the CSIRO Research Paper, and another IPCC Coordinating Lead Author) replied as follows to an inquiry I’d made about CSIRO’s progress in correcting their error:

    “Your [Castles] point about the error in our paper made earlier this year was noted. A request was immediately sent by a colleague through to the relevant people to correct and repost the document after you first pointed this out but it was not done (a production task). On the basis of your recent post pointing out that the phrase had still not been corrected (quelle horreur) we have asked that it be followed through. There is no conspiracy – it was a breakdown in process.” (John Quiggin’s blog, “Drying Out” thread).

    It seems that either (a) the CSIRO editing process has continued to break down for the succeeding two years or (b) it is CSIRO’s policy not to correct errors. I think that the latter is more likely.

  • Ian Castles

    David,

    I agree with your comment on Climate Audit that Kevin Hennessy could be acting on advice in invoking Intellectual Property Rights in order to avoid providing you with the data required to verify the statistical significance of the CSIRO findings ( http://www.climateaudit.org/?p=3267 )

    In an email of 16 March 2006 I advised Kevin Hennessy of an error in a paper of which he had been a co-author (Whetton et al, December 2005, “Australian climate change projections for impact assessment and policy application: a review”, CSIRO Marine and Atmospheric Research Paper 001). The paper stated that the IPCC scenarios were ‘deliberately constructed to be equally plausible’ (p. 33). I was pleased to receive Kevin’s response on 29 March in which he wrote “I agree with you [Castles]. The error will be corrected”. This email was copied to seven of Kevin’s CSIRO colleagues: Greg Ayers, Chris Mitchell, Penny Whetton, Ian Watterson, Kathleen McInnes, Paul Holper and Simon Torok.

    After more than two years, the error has still not been corrected. I think that the reason for CSIRO’s unwillingness to do this is that the paper in which the statement appeared had already been cited as “In prep.” in the First Order draft of the IPCC Report, circulated on 13 August 2005. This makes it difficult for the CSIRO to admit its error.

    On 11 October 2006, Roger Jones (another co-author of the CSIRO Research Paper, and another IPCC Coordinating Lead Author) replied as follows to an inquiry I’d made about CSIRO’s progress in correcting their error:

    “Your [Castles] point about the error in our paper made earlier this year was noted. A request was immediately sent by a colleague through to the relevant people to correct and repost the document after you first pointed this out but it was not done (a production task). On the basis of your recent post pointing out that the phrase had still not been corrected (quelle horreur) we have asked that it be followed through. There is no conspiracy – it was a breakdown in process.” (John Quiggin’s blog, “Drying Out” thread).

    It seems that either (a) the CSIRO editing process has continued to break down for the succeeding two years or (b) it is CSIRO’s policy not to correct errors. I think that the latter is more likely.

  • http://landshape.org/enm admin

    Ian, Thanks for your background information on this and CA. It all points to the conclusion that working in the global warming area is about policy, litigation and quasi-litigation and the scientific mores we were fond of are relics of bygone era.

  • http://landshape.org/enm admin

    Ian, Thanks for your background information on this and CA. It all points to the conclusion that working in the global warming area is about policy, litigation and quasi-litigation and the scientific mores we were fond of are relics of bygone era.

  • david

    Am I right in thinking from this:

    ” For the frequency (temp and rain), the sample is the number of models (13)”

    that the statistics are being carried out on model predictions, not observations? If so, there’s no way this could be regarded as a random sample and a statistical test would be irrelevant. It would all depend on the systematic biases in the individual models (which are presumable deterministic).

  • david

    Am I right in thinking from this:

    ” For the frequency (temp and rain), the sample is the number of models (13)”

    that the statistics are being carried out on model predictions, not observations? If so, there’s no way this could be regarded as a random sample and a statistical test would be irrelevant. It would all depend on the systematic biases in the individual models (which are presumable deterministic).

  • http://landshape.org/enm admin

    Yes, they are model predictions. I contend even tests on the predictions are non-significant.

  • http://landshape.org/enm admin

    Yes, they are model predictions. I contend even tests on the predictions are non-significant.

  • david

    But isn’t it risky (if the aim is to keep these guys honest about testing their models) to allow that such a test is even in principle valid? The way I see it is that it would be possible to pick one of the model parameters, say p, and run simulations such that a selected output (say, predicted temperature in 2050) is some single-valued function f(p) of p. Now, if I find out what statistical test you intend to do then I as the modeller can produce a set of “observations” f(p(i))(model outputs) that can either pass or fail it, at will. In doing that I would obviously be open to major criticisms on other grounds (why those p(i)?) but given the current climate of debate I would easily get away with playing the card: you attacked me on statistical grounds and I won, and now you are trying a completely different tack on dynamical grounds. So when I win that, what next? Personal grounds? Typical Big Oil.

  • david

    But isn’t it risky (if the aim is to keep these guys honest about testing their models) to allow that such a test is even in principle valid? The way I see it is that it would be possible to pick one of the model parameters, say p, and run simulations such that a selected output (say, predicted temperature in 2050) is some single-valued function f(p) of p. Now, if I find out what statistical test you intend to do then I as the modeller can produce a set of “observations” f(p(i))(model outputs) that can either pass or fail it, at will. In doing that I would obviously be open to major criticisms on other grounds (why those p(i)?) but given the current climate of debate I would easily get away with playing the card: you attacked me on statistical grounds and I won, and now you are trying a completely different tack on dynamical grounds. So when I win that, what next? Personal grounds? Typical Big Oil.

  • http://landshape.org/enm admin

    It is risky to trust models too much, unless they have been throughly validated be and based in solid physical knowledge. Wouldn’t you be throwing out all models in your view.

  • http://landshape.org/enm admin

    It is risky to trust models too much, unless they have been throughly validated be and based in solid physical knowledge. Wouldn’t you be throwing out all models in your view.

  • david

    No, models are fine by me in principle. It’s a question of how they are validated. It’s obvious to me (so obvious that I keep thinking I’m missing the point, so please tell me if I am) that each model must be validated independently. If model A predicts temperatures that are far too high and model B predicts temperatures that are far too low, we have too failed models. I don’t think it means anything at all if the average of the two happens to agree reasonably well with observations. To think it does implies some kind of mechanistic understanding about the overestimation of one model being “corrected” by the other. But if that’s the case, surely you would abandon models A and B and incorporate the “correction” to give model C, which is then tested against the observations. I think that doing a significance test is just an extension of that mistake.

  • david

    No, models are fine by me in principle. It’s a question of how they are validated. It’s obvious to me (so obvious that I keep thinking I’m missing the point, so please tell me if I am) that each model must be validated independently. If model A predicts temperatures that are far too high and model B predicts temperatures that are far too low, we have too failed models. I don’t think it means anything at all if the average of the two happens to agree reasonably well with observations. To think it does implies some kind of mechanistic understanding about the overestimation of one model being “corrected” by the other. But if that’s the case, surely you would abandon models A and B and incorporate the “correction” to give model C, which is then tested against the observations. I think that doing a significance test is just an extension of that mistake.

  • http://landshape.org/enm admin

    I agree with you. The spread of the models is a reflection mostly of uncertainty in the models themselves, knowledge of the system, and not of inherent physical variability. that said, there are results that claim to show that averaging or mixing models leads to improved prediction, but the field of the distributional properties of statistics composed of mixtures of models is an active area of statistical research, see http://arxiv.org/abs/math/0702781.

  • http://landshape.org/enm admin

    I agree with you. The spread of the models is a reflection mostly of uncertainty in the models themselves, knowledge of the system, and not of inherent physical variability. that said, there are results that claim to show that averaging or mixing models leads to improved prediction, but the field of the distributional properties of statistics composed of mixtures of models is an active area of statistical research, see http://arxiv.org/abs/math/0702781.