Announce: New fraud detection website

Detecting ‘massaging’ of data by human hands is an area of statistical analysis I have been working on for some time, and devoted one chapter of my book, Niche Modeling, to its application to environmental data sets.

The WikiChecks web site now incorporates a script for doing a Benford’s analysis of digit frequency, sometimes used in numerical analysis of tax and other financial data.

I have posted some initial tests on the site: random numbers and the like. I also ran each of the major monthly global temperature indices through the site: GISS, RSS, UAH and CRU. The results, listed from lowest deviation to highest are listed below.

RSS – Pr<1
UAH – Pr<1 based on global data series Pr<0.001 for whole file (see note)
GISS – Pr<0.05
CRU – Pr<0.01

Numbers such as missing values in the UAH data (-99.990) may have caused its high deviation. I don't know about the others.

Table of results for GISS monthly global temperature data.

Frequency of each final digit: observed vs. expected

0123456789Totals
Observed2973002672682432622552272532352607
Expected2602602602602602602602602602602607
Variance4.925.770.130.181.130.000.104.230.202.4419.10
Significant** *

StatisticDFObtainedProbCritical
Chi Square919.10<0.0516.92

RESULT: Significant management detected.


Significant variation in digit 0: (Pr<0.05) indicates rounding up or down.
Significant variation in digit 1: (Pr<0.05) indicates management.
Significant variation in digit 7: (Pr<0.05) indicates management.

One of the main sources of global warming information, the GISS data set from NASA showed significant management, particularly a deficiency of zeros and ones. Interestingly the moving window mode of the algorithm identified two years, 1940 and 1968 (see here).

Considerable controversy has surrounded the 1940 period, related to possible adjustments for bucket sampling of water temperatures. I am not aware of controversy surrounded 1968 temperature measurements, although 1968 is was a year marked by violent protests, the assassination of Martin Luther King Jr. and Senator Robert Kennedy.

At this stage I am in exploratory mode. The chi-square test is prone to produce false positives for small samples. Also, there are a number of innocent reasons that digit frequency may diverge from expected. However, the tests are very sensitive. Even if arithmetic operations are performed on data after the manipulations, the ‘fingerprint’ of human intervention can remain.

Update:

Thanks to Luboš Motl who checked this data, UAH was confirmed to be manipulation free.