The following is a new version of the About Page. Gradually getting this website organized the way I want it.
I have always been fascinated with prediction.
As an undergraduate I made stock predictors on the first PCs and lost money in 1987.
Studied maths, statistics and started a PhD in ecological prediction.
Developed betting systems and lost money.
Studied algorithms for predicting species distributions and developed GARP which other people used for cool things like finding new species of chameleon in Madagascar.
Developed automated trading systems for FOREX in 2002 and lost money.
So I know a few things about prediction, and more about how not to do prediction. In addition, in this blog I hope to pass on a few, and help people to predict better. Like predicting the risk to poultry from Bird Flu using GIS spatial analysis. Or monitoring the health of different types of hydrocoral polyps on reefs. The possibilities are endless.
Continue reading About Niche Modeling
The paper by S.J. Phillips, R.P. Anderson, and R.E. Schapire — A maximum entropy approach to species distribution modeling — introduces to niche modelers for the first time the Maximum Entropy approach well known in machine learning. They also provide the Maxent software for predicting species distribution using Maxent, and evaluate against a well know method called DesktopGARP in predicting the distribution of two Neotropical mammals, a sloth Bradypus variegatus and a rodent Microryzomys minutus.
The Maxent principle is to estimate the probability distribution, such as the spatial distribution of a species, that is most spread out subject to constraints such as the known observations of the species. Maxent uses entropy as the means to generalize specific observations of presence of a species, and does not require or even incorporate absence points within the theoretical framework. Presence-only points are observations of the presence of a species. For a variety of reasons, absence of a species is not usually recorded.
Continue reading Phillips et al. Maxent
Finally, one journalist has the message right: Duane Freese in his article — â€œHockey Stick Shortened?â€ — at TechCentralStation reports on the National Academy of Sciences report “Surface Temperature Reconstructions for the Last 2,000 Years“. Repetition of the consensus view of strong evidence of recent global warming is not newsworthy. Increase in the uncertainty of the Millennial temperature record is. He says:
The most gratifying thing about the National Academy of Science panel report last week into the science behind Michael Mann’s past temperature reconstructions – the iconic “hockey stick” isn’t what the mainstream media have been reporting — the panel’s declaration that the last 25 years of the 20th Century were the warmest in 400 years.
The hockey stick, in short, is 600 years shorter than it was before and the uncertainties for previous centuries are larger than Mann gave credence. And when the uncertainty of the paleoclimatogical record increases with time, the uncertainty about human contribution is likewise increased. Why? For a reason noted on page 103 of the report: climate model simulations for future climates are tuned to the paleoclimatogical proxy evidence of past climate change.
Continue reading Rings of Noise on Hockey Stick Graph
The trial of Hwang Woo Suk has begun, and it would be a good time to clarify some of the issues to be adjudicated.
Hwang was indicted on charges of fraud, embezzlement and bioethics violations. Hwang has admitted ethical lapses in human egg procurement for his research.
As to embezzlement charges, prosecutors say Hwang used bank accounts held by relatives and subordinates in 2002 and 2003 to receive contributions from private organizations, laundered the money by withdrawing it all in cash, breaking it up into smaller amounts and putting it back in various bank accounts. They claim he bought gifts for his sponsors, politicians and other prominent social figures, and bought a car for his wife.
But it is the fraud charges that are interesting. Prosecutors say they will take no separate action over the fabrication of data in Hwangâ€™s published research. Prosecutors say:
â€œThere is no precedent in the world where someone has been punished for fabricating research results, and the matter should be left to academic mechanisms.â€
Continue reading Hwang Woo Suk Trial Sets Precedent for Academic Fraud
What is ‘results management’?
Accountants and auditors are often concerned with various kinds of alteration of figures, a practice euphemistically called ‘earnings management’. For example in “An Assessment of the Change in the Incidence of Earnings Management around the Enron-Andersen Episode” – Mark Nigrini
In 2001 Enron filed amended financial statements setting off a chain of events starting with its bankruptcy filing and including the conviction of Arthur Andersen for obstruction of justice. Earnings reports released in 2001 and 2002 were analyzed. The results showed that revenue numbers were subject to upwards management. Enronâ€™s reported numbers are reviewed and these show a strong tendency towards making financial thresholds.
Benford’s law which is a conjecture concerning the expected frequency of digits in unmanaged data, is useful for detecting fraud and other forms of results management. I have posted on some results applied to time series data here, and here. An R module used for analysing these time-series data is available from this site here.
Continue reading Results Management
The NAS report has been chastised here and here for concluding that it is â€œplausibleâ€ that the â€œNorthern Hemisphere was warmer during the last few decades of the 20th century than during any comparable period over the preceding millenniumâ€ while at the same time conceding that every statistical criticism of MBH is correct, disowning MBH claims to statistical skill for individual decades and years, and finding little confidence in reconstructions of surface temperatures from 1600 back to A.D. 900, and very little confidence in findings on average temperatures before then.
One of the main justifications for this plausibility was the “general consistency” of other studies.
The committee noted that scientists’ reconstructions of Northern Hemisphere surface temperatures for the past thousand years are generally consistent. The reconstructions show relatively warm conditions centered around the year 1000, and a relatively cold period, or “Little Ice Age,” from roughly 1500 to 1850. (NAS press release)
I want to show to what a vacuous motherhood statement this is. Previously I have shown that virtually any data, including random data, will produce a graph similar to the existing studies if you ‘cherry pick’ proxies for correlations with temperature and then you squint your eyes a bit here, here, and here. Today I thought I would run a simple statistical test see how consistent each of the reconstructions really is.
Below I have plotted up each of the reconstructions and CRU temperatures using a lag plot. A lag plot shows the value of a time series against its successive values (lag 1). A lag plot allows easy discrimination of three main types of series:
- Random – shown as a cloud
- Autocorrelated – shown as a diagonal, and
- Periodic – shown as circles.
Millennial temperature reconstructions on a lag plot.
Continue reading 2000 Years Surface Temperatures Time Series Lag Plot
The play by Rolin Jones “The Intelligent Design of Jenny Chow” is a tale of science fueled by post-adolescent angst with a brilliant young woman who excels at rocket science but canâ€™t leave her bedroom. Driven by a real life quest to find her biological mother, she pilfers parts from her government rocket project and builds a replica of herself, named Jenny Chow to meet her real birth mother in China, by proxy.
The publicity photo for “The Intelligent Design of Jenny Chow,” an award-winning play by recent School of Drama graduate Rolin Jones.
Jenny’s robot is contemporary version of the Simulacrum, that creation of device for a purpose in a parody of reality. Such appears to be the case of the ‘hockey stick graph’ — a reconstruction of Millennial global temperatures based on tree-ring and other proxies by MBH98. The graph, showing temperatures relatively stable throughout the middle ages and recent times, and shooting upward suddenly last century has been an iconic feature of the IPCC 2001 report on climate change, and countless government reports, slide shows, powerpoints and papers since its invention.
Continue reading The Intelligent Design of MBH98
The seasons are natural climate variations, a word derived from the Latin satin, act of sowing, from satus, past participle of serere, to plant. Seasonings enhance the flavor of food, add zest to speeches, make us competent through trials and troops seasoned by battle. The variability of our environment keeps life interesting.
Eagerly awaited is the release on Thursday, June 22 of the report of the National Academy of Sciences on ‘
Surface Temperature Reconstructions for the Last 2,000 Years‘, describing and assessing the state of scientific efforts to reconstruct surface temperature records for the Earth over approximately the last 2,000 years based on tree rings, boreholes, ice cores, and other â€œproxyâ€ evidence.
The earth has seasons on an annual timescale. There are seasons on a 100,000 year timescale, where glaciations are punctuated by warmer periods, such as the present. The release on Thursday, June 22 of the report of the National Academy of Sciences addresses the controversial question: “How seasonal is climate over the Millennial timescale”, a period called the late Holocene.
Continue reading Millennial Temperature Seasonality Index Report Due
Resources: Controversial Topics — a source for new important information on Ecological Niche Modeling emerging from nightly searches for new images, web pages, videos, news articles and scientific articles.
Downloads: Peer-reviewed PDF’s on niche modelling theory, practice, biodiversity and information infrastructure.
David R.B. Stockwell, Improving ecological niche models by data mining large environmental datasets for surrogate models, Ecological Modelling 192 (2006) 188â€“196. PDF
Continue reading Ecological Niche PDFs
Summary: Predictive modelling shows the response of the BTS is an asymptotic function of precipitation, with no real upper limit. It would therefore be a threat to all tropical areas no matter how high the rainfall.
Resources: Controversial Topics — a source for new important information on Brown Tree Snake emerging from nightly searches for new images, web pages, videos, news articles and scientific articles.
The Brown Tree Snake (BTS) is a significant threat to many native species where it has invaded. There are a large number of ongoing efforts to prevent its movement into Florida, Hawaii, Texas, and other potential new habitats.
If brown tree snakes come here from Guam, they would cause “the greatest catastrophe of the century,” an federal official says.
The BTS (family Colubridae) belongs to the genus Boiga, a group of about 25 species that are referred to as “cat-eyed” snakes due to their vertical pupil. They are rear fanged, have a large head in relation to its body, brownish or greenish, sometimes faint bands. Adults are generally 4 -5 feet long. BTS can survive for extended periods of time without food, a trait that enables it to survive in ship-bound and stored cargo for long time periods.
BTS currently occur beyond their native range. They were introduced to Guam during World War II with a single female snake and spread to all parts of Guam by the late 1960s. Subsequently it has been shown they are responsible for a drastic decline in numbers of all native birds on the island.
BTS predictive model
Continue reading Brown Tree Snake (Boiga irregularis) — predicting the extent of the threat.
Avian flu (also “bird flu”, “avian influenza”, etc), is a flu from Influenza A viruses that have adapted to birds. As of 2006, “avian flu” refers to a particular subtype of Influenza A, H5N1, currently the world’s major flu pandemic threat.
“Though human-to-human transmission of avian flu still has not been confirmed scientifically, you need to take precautions while covering the issue in the field,â€
Professor Luhur Suroso.
While human-to-human transmission has not been confirmed (although suspected in the death of a family in Indonesia) the mortality rate for humans who do contract it is high, around 50%. The highly pathogenic form spreads rapidly through flocks of poultry. The disease has a mortality rate that can reach 90-100% in poultry, often within 48 hours. Because of this the disease is of immediate concern to the poultry industry and to conservators of native birdlife.
Continue reading GIS, Avian Influenza and Data Hoarding
Stung by a string of controversies, corrections and frauds, and inspired we hope by the work of science bloggers in reinstating a culture of broad scientific debate, Nature magazine has instituted a what it calls a ‘Peer Review Trial‘.
In Nature’s peer review trial, lasting for three months, authors can choose to have their submissions posted on a preprint server for open comments, in parallel with the conventional peer review process. Anyone in the field may then post comments, provided they are prepared to identify themselves.
Looks like a blog, sounds like a blog, is a blog!
Some very interesting articles critical of existing peer review in the
raise points already made on this blog here,
here. Here is a list of topics presently in the forum.
- An open, two-stage peer-review journal.
- Reviving a culture of scientific debate.
- Can ‘open peer review’ work for biologists?
- Researchers need reviewers to check their stats.
- Analysing the purpose of peer review.
- What authors, editors and reviewers should do to improve peer review.
- Wisdom of the crowds.
- Scientific publishers should let their online readers become reviewers.
- Certification in a digital era.
- The pros and cons of open peer review.
- Should authors be told who their reviewers are?
- Even reviewed literature can be cherry-picked to support any argument.
I wish them all the best, but credit should go where it is due. Without science bloggers and participants donating their time to finding flaws in exisiting peer reviewed articles — such examples in Nature as the takedown of South Korean human stem-cell researcher Hwang, and corrections of Mannâ€™s false statistics supporting the hockeystick graph of climate history by McIntyre — the process of reform would not have begun.
There is much to be concerned with about the level of verification and due diligence in climate science, biodiversity and other areas, and Nature is taking some positive steps, drawing on the wisdom of the blogosphere.
Here is another prediction quiz, again suggested by Demetris Koutsoyiannis, a little different to the one that challenged readers here. In this case the quiz is not to guess the underlying model (exact solution) but to find an assumed model or technique of any type that can give good predictions based on the past statistical behaviour, autocorrelation, or perhaps reconstructed dynamics (in case of ANN or chaotic nonlinear methods).
The time series given in the text file neomail.txt (500 values) was generated by a
mathematical model. The time series is characterized by strong autocorrelation and perhaps long term persistence. At this time the model type is not disclosed but it uses a single algorithm whose application can be continued to give at least 50 more data values (the “true” values). These will be disclosed along with the model at the end of the quiz. Meanwhile, can you predict them?
Here is the sequence of values generated
by a deterministic algorithm. The challenge is to accurately predict the
next 50 values.
Continue reading A Challenge to Prediction Experts
Predictive models for business analytics often involve
very complex data-mining and other statistical techniques.
Here is a simple, efficient way of predicting using images that reduces the
prediction process to its bare essentials.
All models are essentially generalizations —
simplifications into patterns that enable extrapolation into the unknown.
As such, one of the simplest forms of generalization
is categorization, where a large number of dissimilar items are sorted into a smaller number of bins, based on their similarity. Once a set of bins or categories is established, and there is a basis for deciding into which bin new items should go, new items can be categorized. In this way, a categorization, or clustering can serve as a predictive model. And, as categorization is
a basic operation producing an color palette for an image,
images can be used to develop models, and palette swaps used for prediction.
Figure 1. An colored image, before and after categorizing into five colors.
Continue reading Efficient Prediction by Palette Swapping
The conventional approach to information systems is ‘three-tiered': consisting of
a database, and analysis system and a presentation system. For example,
a low-cost system for predictive business analytics might use a MySQL database,
the statistics languge R for analytics, and a browser for presentation of results.
While there are advantages in
partitioning clearly delineating levels, inefficiencies and errors introduced when transferring
data between levels, particularly in the reading and writing of data to databases.
A novel approach producing very efficient systems for financal analysis is to
integrate database and analysis functions into the one, memory resident application.
Using this approach the vector-based K language has produced
the fastest industry trading applications in existence — over 100 times
faster than equivalent three-tiered applications.
In this section we show how to use R as a database by replicating
relational database operations including select and join.
As well as simplicity and efficiency for smaller systems,
the simple customer example helps to build knowledge of the
R’s powerful indexing operations.
Continue reading Novel Product Database Using Vectors
Business decision-making using information such as customer profiles to predict returns on investment is coming to be know as the Predictive Enterprise (PE). While business analytics — or ‘quants’ — are not new, engineering a greater a degree of sophistication and integration of predictive modeling into business processes and structures is showing high returns for many industries.
A range of software tools for the predictive enterprise are beginning to be available. Well known vendors like SPSS appear very serious about enabling the PE with off-the-shelf business solutions for education, financial, marketing, insurance and telecommunications industries. A new book “The Power to Predict” tells how stepping beyond the real-time to anticipate trends is changing the way web stats are analysed, software is deployed, and information systems are designed.
Successful predictive analytics relies heavily on a few basic concepts in mathematics and statistics. Of-the-shelf statistical packages hide the complexity and can be useful if they are tailored exacly to your application. But in many cases, each situation is a little different. Also, understanding of the basics and the pitfalls is necessary for communicating the results with confidence. Niche modeling, as developed in ecology, has developed sophisticated tools
and treatments directly applicable to predictive analytics for business.
This post describes how predictive analytics can be expressed in terms of niche modeling.
The analysis is done via the R language, a powerful, reliable and free statistical program in the manner of the S statistical language.
Using R for predictive analytics is a low-cost and flexible solution, but does require a basic knowledge of statistics and mathematics. R is a very powerful language for a number of reasons. However, the main feature is vector processing — the ability to perform operations on entire arrays of numbers without explicitly writing iteration loops. This allows code to be shortened considerably, loops implemented efficiently, and encourages a parsimonious style of programming around larger data structures that is easier to maintain.
For people not familiar with the R language, a summary of the major types and operations comparison here provides a rapid tutorial into using R language for niche modeling.
Continue reading Better Predictive Analytics Using Niche Modeling
Hans Erren brought to my notice a study he did on artifacts in the Quelccaya ice core record, first observed by Steve McIntyre (of climateaudit.org). I thought I would run this through benford() to see if it could pick up the problem. The study is here at Hans’ home site and the dataset is here at the NOAA data archive.
Here the original figure with the artifact shown as fanned out lines.
According to Hans’ study the artifact is caused by rounding up of the accumulation results to whole centimeters. This form of ‘result management’ reduces the degrees of freedom of results, reducing essentially real numbers to a more limited integer domain, clearly seen in Hans’ figure.
Continue reading Detecting Rounding of Results