## Examples of Ecological Niche Models from GARP

GARP is an acronym for Genetic Algorithm for Rule Set Production. GARP is an algorithm primarily designed for predicting the potential distribution of biological entities from raster based environmental and biological data. This post describes examples of the interpretation of different sets of rules developed by GARP.

## Abundance of Greater Glider

The Greater Glider (Petauroides volans) is a species of gliding possum found extensively in old-growth forest regions of South Eastern Australia. It nests in hollows created by the broken limbs of eucalyptus trees, and feeds on eucalyptus leaves of a variety of species. The species is of interest for conservation because their presence is an indicator of the presence of a suite of arboreal marsupial species.

The Waratah Creek data set is a mapping of an area 1600 ha in extent, in a 20×20 grid, located at Waratah Ck. It contains eight data layers. The first is the density of Greater Gliders at four levels, while the remaining variables are based on forest inventory variables known to be relevant to possum density. The data set and comparison of the performance of a number of other Artificial Intelligence methods is described in Stockwell et.al. (1990). The variables are shown, in row-column order in Figure 1.

Figure 1. The variables in the Waratah Creek data set in row-columns order are GG Density, Dev development, (road corridors, pine plantations), StC stream corridor (proximity), SdC stand condition (merchantable timber) , StQ site quality (productivity), FlN floristic nutrients (based on vegetation types), Slp slope, and Ero erosion potential. Dark squares are low values, and lighter squares are higher values.

The data set is a useful small, test data set for comparing predictive algorithms, and is included in the distribution of the GARP program. It is particularly useful for testing predictive algorithms because there are complex combinations of ecological relationships within it.

## Bayesian Networks

The problem with many models, from climate systems to multiple species and ecosystems processes, to consumer purchasing behaviour is that we often have very little understanding of the actual relationships between the variables in the system.

From our limited vantage point as observers of and not experimenters on systems we only see many weakly correlated variables, often drawn from incomplete samples and widely ranging sources.

We need an automated method of developing structure from the given data that explicitly quantifies our belief that a model that captures the behaviour of the system. Bayesian nets, Beliefs nets or graphical models begin to do this, by assigning a level of belief to each of the possible values of parameters. That is, while a conventional simulation of climate say has at most one value at each simulation, a Bayesean network would represent the distribution of possible values for each parameter at each point in time.

Belief net construction can involve a manual process of knowledge engineering. Examples of systems for graphically structuring models are the Ptolemy project for modeling, simulation, and design of concurrent, real-time, embedded systems, or the freely available ‘scientific workflow’ tool called Kepler where the flow of data from one analytical step to another is captured in a formal workflow language.
Recent advances in machine learning and data mining have also yielded efficient methods for creating belief nets directly from data (Cooper and Herskovits, 1992).

## Quantitative Niche Market Research

Niche marketing is the process of finding and serving small but potentially profitable market segments. These small market segments can be visualized as part of a “long tail”, a term elaborated by Chris Anderson in his longtail blog.

Niche markets are important for small businesses, as they can find it profitable to serve markets too small for mainstream businesses. Anderson argues that products with low sales volume can collectively exceed the relatively few current bestsellers. An example is a relative handful of weblogs have many links going into them but “the long tail” of millions of weblogs may have only a handful of links going into them.

## Writing a Book Using R

Think faster than “How to write a book in 28 days”. With the freely available R language you can create a book in less than 28 seconds. Unfortunately, you still have to write the text and do the programming. What you can do is integrate the R code and text into the same files, then generate the figures and latex text together. This adds a lot of flexibility and organization for highly technical productions, and avoids the hassle of cross referencing.

In my book, “Niche Modeling” which finally been sent to the publishers I incorporated many tables of and figures results on circularity and reconstructions published here over the last 6 months, almost all generated on the fly from R data structures Sweave and xtable. A push of a button runs all the R scripts for generating plots, tables, and outputting latex.

There were some technical issues and last minute formatting glitches though that I want to document here for posterity.

## Organizing

The starting point for organizing a multipart book was this short note — LaTeX Files for a Book or Thesis. However, I put all the 12 chapters in subdirectories, and included the chapters from a master.tex file in the parent directory. I also used a single chapter.tex file, so I could work on one chapter at a time easily.

## Sweave

Sweave is an R package that allows ‘literate programming’ or integrating code and documentation. For example the code blocks are included in the latex like below. Then the figure is referred to in the text as Figure \ref~{fig1}. On running sweave, the figure is generated as a postscript file, and the appropriate latex for inserting and referencing it in the document added. This save a lot of annoying cross-referencing.

\begin{figure}
< >=
... R code ...
@
\caption{This is a figure}
\label{fig1}
\end{figure}


I also needed to include an option to write figures to another directory
to keep them from cluttering up the chapter directories.

SweaveOpts{prefix.string=../figs/chap12}


Sweave is run with the command Sweave(“script.R”).

Here is where the problems started. The publisher required all fonts
embedded in the pdf file. This includes the figures. The default font in R figures is Helvetica which is not available for embedding by the latex compiler. I had to use ps.options(family=”NimbusSan”) to specify another font.

## Embed fonts with ghostscript

Normally I used iTexMac for compiling latex files. For final preparation I also had to compile then ensure all the fonts were embedded with the following ghostscript command.

pdflatex master
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=master2.pdf master.pdf


To check the fonts are embedded, open the file in acrobat and look at “document properties” under “fonts”. All the fonts should say “Embedded Subsetted”. After doing this there was still a single font not embedded called R1002 or R1004 depending on how I compiled it and I could not find any information about it. The publishers technical person found it was due to a single apostrophe in a code listing! Something to watch out for.

The publishers also required that I use their style file. This style used the latex directive \tabletitle{} instead of \caption{} for table captions. As I was using the R package xtable to generate tables I couldn’t change them. xtable is really useful, producing nicely formated latex for R data structures like dataframes, model output, time series. But I had to change the xtable code where it writes \caption{ to \tabletitle{ and also set it to write them on top of the table block by default, not below.

Another issue was the code listing in R would exceed the page with. I found that by reducing the width of the console window would also shorten the breaks in output strings written to latex files.

That is about it for the moment. I wish I had another 3 months to fiddle with the figures and explain things more. But I have to get it in or it won’t be published this year.

## Multiple Lines of Plausible Evidence

The process behind the Mann “hockey stick” paper, featured by Al Gore in his movie “An Inconvienient Truth” has been damned by the US National Academy of Sciences Report. (the Wegman Report.) But how much difference should this make to belief in global warming?

How central is the work of Drs. Mann, Bradley, and Hughes to the consensus on the temperature record?

Ans: MBH98/99 has been politicized by the IPCC and other public forums and
has generated an unfortunate level of consensus in the public and political sectors and has been accepted to a large extent as truth. Within the scholarly community and in certain conservative sectors of the popular press, there is at least some level of skepticism.

How could we answer this question quantitatively?

Consider the status of Anthropogenic Global Warming (AGW) claims of unanimous certainty by Naomi Oreskes
that “the scientific community believes that the Earth is warming and that human activities are the principal cause”.

Yet the IPCC claims AGW is “likely”. IPCC defines the terms â€œlikelyâ€ (meaning 64-90%) and â€œvery likelyâ€ (meaning >90%). Under this definition, AGW would not rise to the level of ‘significant’ in a scientific test based usually on a threshold of 95% levels of confidence. Classical science operates through such definitive high confidence beliefs. In this way the possibility of being wrong in a claim is reduced to very low probability, an event in the ‘long-tail’ of the probability distribution.

On listening to the general arguments in the Wegman hearing it seemed that the scientific consensus for anomalous warming of the 20th century comes by accumulating multiple lines of plausible evidence (e.g. glaciers, models, proxies, biota, etc).

Application of simple probability theory leads to some interesting observations about how multiple lines of low confidence evidence stack up against a single high confidence claim. Simple probability theory gives us a handle on it.

## Hwang Woo Suk, Blogs, and Mann

I’ll make it clear from the outset that the falsification of data by Hwang Woo Suk and flawed results of Mann Bradley and Hughes are completely different situations. There are some similarities though. One, seemed to open a door to therapeutic cloning that could benefit millions of people with debilitating illnesses such as Alzheimer’s and Parkinson’s disease. The others provided a breakthough view of millennial climate history that seemed to prove humans were altering climate in an unprecedented way. They both have thousands of supporters wanting to give them a chance to prove the findings correct. And the problems were both discovered by scrutiny outside the peer review process.

At the hearings
Questions Surrounding the â€˜Hockey Stickâ€™ Temperature Studies: Implications for Climate Change Assessments
the question not asked directly was how many other studies have been subjected to independent scrutiny? Certainly not by the IPCC that only conducts a literature review. To my knowledge no other climate studies have been audited as McIntyre and McKitrick did for the hockey stick. Yet this scrutiny is probably less than might be done in evaluation of major engineering projects, ore body reserves or any significant business venture. The hockey stick was one climate study among many subjected to auditing in depth and found wanting.

According to the Wall Street Journal editorial, a semiretired Toronto minerals consultant and an economist, with about $5,000 of his own money and time, took on an apparently simple task — trying to double-check the influential graphic known as the “hockey stick” — and eventually confronted an influential scientific community before a Congressional Committee, and won. What was the sequence of events? 1990 — MWP — Based on numerous anecdotal studies, the Intergovernmental Panel on Climate Change (IPCC) in 1990 included a schematic view of the past 1000 years there was a period of elevated temperatures known as the Medieval Warm Period, which was followed by the Little Ice Age, and then a new period of global warming. Alternative options for temperature history. IPCC 1990 Figure 7.1.c (red), MBH 1999 40 year average used in IPCC TAR 2001 (blue), and Moberg et al 2005 low frequency signal (black) from Wikipedia. 1998 — MBH98 — Mann, Bradley and Hughes published a quantitative study using a new climate field methodology, showing temperatures as a hockey stick shape, and eliminated the Medieval Warm Period, flattening the fluctuations in global temperatures over most of the past millennium (the blade of the hockey stick) until we get to the 20th century, where the rate of global warming takes off in a sharp upward surge (the handle of the hockey stick). ## Avian Influenza No Spatial Change The WHO Epidemic and Pandemic Alert and Response reports no change in the spatial distribution of human cases of H5N1 avian influenza, although cases occurring in Indonesia is consistent with linear projections. Below are most recent cases from affected countries. Azerbaijan 11 April 2006 — The case is a 17-year-old girl who developed symptoms on 11 March. She was seriously ill with bilateral pneumonia but has since fully recovered and been discharged from hospital. Cambodia 6 April 2006 — The Ministry of Health in Cambodia has confirmed the countryâ€™s sixth case of human infection with the H5N1 avian influenza virus. The case occurred in a 12-year-old boy from the south-eastern province of Prey Veng, which borders Viet Nam. China 16 June 2006 — The Ministry of Health in China has confirmed the countryâ€™s 19th case of human infection with the H5N1 avian influenza virus. The patient is a 31-year-old man employed as a truck driver in Shenzhen City, Guangdong Province, near the border with Hong Kong. Djibouti 12 May 2006 — The Ministry of Health in Djibouti has confirmed the countryâ€™s first case of human infection with the H5N1 avian influenza virus. The patient is a 2-year-old girl from a small rural village in Arta district. She developed symptoms on 23 April. She is presently in a stable condition with persistent symptoms. Egypt 5 May 2006 — The Ministry of Health in Egypt has announced the countryâ€™s 5th death from H5N1 avian influenza. The death occurred in a previously reported case, a 27-year-old woman from Cairo. She was hospitalized on 1 May and died on 4 May. Indonesia 14 July 2006 — The Ministry of Health in Indonesia has confirmed the country’s 53 rd case of human infection with the H5N1 avian influenza virus. The case, which was fatal, occurred in a 3-year-old girl from a suburb of Jakarta. Iraq 1 March 2006 — A WHO collaborating laboratory in the United Kingdom has verified H5N1 avian influenza as the cause of death in a 39-year-old Iraqi man, previously announced by the Ministry of Health. Thailand 9 December 2005 — The Ministry of Public Health in Thailand has confirmed a further case of human infection with the H5N1 avian influenza virus. The case occurred in a 5-year-old boy, who developed symptoms on 25 November, was hospitalized on 5 December, and died on 7 December. The child resided in the central province of Nakhonnayok. Turkey 30 January 2006 — A WHO collaborating laboratory in the United Kingdom has now confirmed 12 of the 21 cases of H5N1 avian influenza previously announced by the Turkish Ministry of Health. All four fatalities are among the 12 confirmed cases. Viet Nam 25 November 2005 — The Ministry of Health in Viet Nam has confirmed a further case of human infection with H5N1 avian influenza. The case is a 15-year-old boy from Hai Phong Province. He developed symptoms on 14 November and was hospitalized on 16 November. He has been discharged from hospital and is recovering. ## When r2 Regression is Very High The coefficient of determination, or r2, is the ratio of explained variation to total variation of two variables X and Y. The coefficient ranges from âˆ’1 to 1, where a value of 1 is an exact, positive linear relationship, with all data points lying on the same line. A value of 0 shows no linear relationship between the variables. If r2 is high, most people assume X and Y are related. Possibly, if the errors are “normal” or independent and identically distributed. But r2 may also be high and X not related to Y when natural series have “spurious significance”. It is hard to do better than this set of articles on high correlation statistics from Steve McIntyre at ClimateAudit to explain “spurious significance”. ## Simple Linear Regression of Rainfall From forecasting the onset of the Monsoon in India, to mapping the drought in the USA, and worsening drought in Australia , insight from statistics into rainfall patterns affects everyone. The following documents contain technical information on regression models and rainfall, organized from the simplest linear regression to the more technical local spatial and temporal fitting techniques. Statistics 301 Handout #30: Simple Linear Regression contains R code with simple regression exercises. Partial Correlation Coefficients by Gerard E. Dallal, Ph.D. provides some climatic related examples of scatterplots and partial correlation coefficients. More advanced approaches to reconstruction of rainfall fields use a form of local regression. Here a partial thin plate smoothing spline is used for SPATIAL MODELLING OF CLIMATIC VARIABLES ON A CONTINENTAL SCALE Here is another good example where local fitting techniques have been used for Estimation of Precipitation by Kriging in the EOF Space of theSea Level Pressure Field. Some of the most advanced and insighful work on rainfall modeling is by Koutsoyiannis, e.g. An entropic-stochastic representation of rainfall intermittency: The origin of clustering and persistence, Water Resources Research, 42(1), W01401, 2006. Here is the technical documentation of software SPLINA and SPLINB. The additive regression model appears to be a practical option for analysing spatially varying effects of several predictors on observed phenomena. It is attractive from the point of view of overcoming curse of dimension problems associated with the analysis of noisy multivariate data. Moreover its implementation is a straightforward extension of standard thin plate spline methodology. ## Intelligent Design of Jenny Chow — Act II Just after the release of the National Academy of Sciences report ‘Surface Temperature Reconstructions for the Last 2,000 Years‘ I compared the â€˜hockey stick graphâ€™ â€” a reconstruction of Millennial global temperatures based on tree-ring and other proxies by Mann and colleagues — to a broadway production in The Intelligent Design of MBH98. Statistics packages like R are fun for ‘doing it yourself’ and getting involved in the debates. However, for critical policy oriented work on climate I strongly support the main findings of the recently released report Ad Hoc Committee Report on the ‘Hockey Stick’ Global Climate Reconstruction “, aka the Wegman Report namely: • Independent statistical expertise should be sought and used. The report states on page 17 under the subsection discussing the background to principal components analysis (PCA) that: In reality, temperature records and hence data derived from proxies are not modeled accurately by a trend with superimposed noise that is either red or white. There are complex feedback mechanisms and nonlinear effects that almost certainly cannot be modeled in any detail by a simple trend plus noise. These underlying process structures appear to have not been seriously investigated in the paleoclimate temperature reconstruction literature. Cohn and Lin (2005) make the case that much of natural time series, in their case hydrological time series, might be modeled more accurately by a long memory process. Long memory processes are stationary processes, but the corresponding time series often make extended sojourns away from the stationary mean value and, hence, mimic trends such as the perceived hockey stick phenomena. They then go on to argue for exactly the kind of models we have been describing in this blog. ## Options for ACF in R The autocorrelation function is one of the Niche model basics in the R language. The ACF is the real deal when it comes to diagnosing the internal character of series data, as was done for scale invariance in Scale Invariance for Dummies. Questions of internal character of series are very topical, such as “How Red are My Proxies” where it was claimed that global temperatures are barely AR(1) contrary to views of a colleague discussed in “A Mistake with Repercussions“. The ACF function in R contains another useful diagnostic option called the partial autocorrelation. Partial autocorrelation is estimated by fitting autoregressive models of successively higher orders up to the lag.max. Partial autocorrelations are useful in identifying the order of an autoregressive model. The partial autocorrelation of an AR(p) process is zero at lag p+1 and greater. I thought I would have a look at some data to see what the partial ACF has to say about the possible AR order of temperatures series. Figure 1 above shows the partial correlation coefficients of the simple series, from top to bottom, IID (random independent and identically distributed), MA (moving average of IID), AR (autoregression), and SSS (fractionally differenced FARIMA model). The partial correlations of the IID and the AR(1) zero at lag 1. The SSS is not zero until greater than lag 5. Figure 2 shows partial ACF, from top to bottom, of the natural series CRU (global average temperatures), MBH99 (Mann’s hockey stick), and spatial precipitation and temperature. The CRU also has significant correlations at lag 5, as does the MBH99 reconstruction of temperatures back to 1000AD. The partial correlations of the spatial temperature and precipitation also have slight partial correlation above lag 1. ## Global Distribution Models of Species To develop global distribution models of species you need global data. A major component of the datamining methodology in WhyWhere is access to a large number of global datasets. At last count over 1000 layers were available for download, testing for correlates and building models. These are organized as lists. The following are the lists with the number of global variables in each list in brackets. Clicking on the link lists the files in the list, with links to the meta data, a thumbnail image of the data, and the data itself in pgm (portable gray map) image form. ## PDFs on Biodiversity Modeling Biodiversity modeling is about multiple species and their relationships, in contrast to individual species and their environment. This leads to slightly different approaches and problems. My selection of the documents of historic influence, the programs in the field, and some of the more important recent papers follow. ## Reports ## Standard Statistical Packages If you have an interest in Statistics packages, for finance, investment, or science, you soon find out how expensive they are. I thought I would do a short survey of the standard statistical packages, and see how the free ones compare. MiniTab — “The World Trusts Minitab for Quality”. A 30 day free demo is available. The market for Minitab is corporate quality improvement. It appears to have a nice graphics and user interface and a simple macro language. The cost is$1195. The product seems relatively simple compared with the other statistics packages.

SAS — “The power to know”. SAS is very strong on integrated business analytics and prediction, with a comprehensive array of 100s of different products for different industries and applications. For example, integrated business systems include a database and visualization. There was no published price information.

SPSS — “Enabling the predictive enterprise”. SPSS is pitching predictive enterprise concepts aimed at increasing return on investment in the corporate environment. They have many different products across technologies and industries (e.g. risk management, fraud, data mining, healthcare) with each package about $400-$500 (e.g. classification and regression trees is a package).

S-plus — “Delivering the Knowledge to Act”. This is a package with a language based on S oriented towards industries e.g. life sciences, telecommunications, manufacturing (and not so corporate finance oriented). No price was provided.

Matlab — “Accelerating the pace of engineering and science”. Matlab is strong on numeric simulation in engineering fields, with its Matsim package, can integrate many complex applications in statistics and signal processing. The web site claims a 15-day trial is available for most products but only if have an existing product license. The cost is dependent on academic or other pricing, but expect a few hundred for basic package, and additional hundreds for each extra module.

Maple 10 is a little different from the other statistics packages, as an algebra system, it can manipulate equations, substitute, simplify, manage, and help publish the most complex mathematics. It also works with Matlab.

## Predicting Numbers

Niche modeling is about using probability distributions to predict. From the obscure journal article department comes a strategy for winning at Lotto using niche modeling.

Schemes have often been proposed to predict the best numbers for lotteries. One of the most common uses the fact that some numbers are bet more than others, thus betting the unpopular numbers increases returns. A paper entitled Estimating the Frequency Distribution of the Numbers Bet on the California Lottery by Mark Finkelstein, November 15, 1993, presented evidence supporting this hypothesis.

As the distribution of numbers bet in the California Lotto is not public information, the distribution was estimated by a complex analysis from the public data on winning numbers. The results agreed with the Canadian Lotto where the distribution of numbers is made public. Based on the winning numbers for 176 games, they estimated the non-uniformity of the probability of a number being in the winning 6 numbers as below.

Estimate of the probability of numbers from 1 to 51 being in the set of 6 numbers bet.

## Intelligent Thought by Brockman. Review

This book is a compilation of arguments against Intelligent Design (ID), from sixteen of the world’s leading evolutionary biologists. It includes the full text of the Harrisberg PA, Dec. 20 judgement where a federal judge ruled against teaching ID in schools. The book will reinforce the arguments of the judge and perhaps stop ID from gaining a foothold in schools in other states. An entire school board advocating teaching a supernatural explanation for natural events reminds one of why its necessary to communicate not only the results but the methods of science and improve numeracy in elementary school children.

The judges were of the opinion that while the proponents of ID held deep beliefs and that ID should continue to be discussed, it was unconstitutional to teach it in classrooms as an alternative to evolution as ID was merely ‘an interesting theological argument’. The strongest statement was that natural science does not admit supernatural explanations — to do so is not science.

Abolishing ID by definition from science would set ID proponents back but hardy placate them, as they have been trying to promote an experimental research program. The problem is, and this is the value of the book by Brockman, there is no experimental evidence for ID, and an overwhelming amount of evidence for evolution. As Stephen Hawking recently said:

SH: There is no evidence for intelligent design. The laws of physics and chemistry, and Darwinian evolution, are sufficient to account for everything in the universe.

## Niche Modeling. Chapter Summary

Here is a summary of the chapters in my upcoming book Niche Modeling to be published by CRC Press. Many of the topics have been introduced as posts on the blog. My deepest thanks to everyone who has commented and so helped in the refinement of ideas, and particularly in providing motivation and focus.

Writing a book is a huge task, much of it a slog, and its not over yet. But I hope to get it to the publishers so it will be available at the end of this year. Here is the dustjacket blurb:

Through theory, applications, and examples of inferences, this book shows how to conduct and evaluate ecological niche modeling (ENM) projects in any area of application. It features a series of theoretical and practical exercises in developing and evaluating ecological niche models using a range of software supplied on an accompanying CD. These cover geographic information systems, multivariate modeling, artificial intelligence methods, data handling, and information infrastructure. The author then features applications of predictive modeling methods with reference to valid inference from assumptions. This is a seminal reference for ecologists as well as a superb hands-on text for students.

## Part 1: Informatics

Functions: This chapter summarizes major types, operations and relationships encountered in the book and in niche modeling. This and the following two chapters could be treated as a tutorial in the R. For example, the main functions for representing the inverted â€˜Uâ€™ shape characteristic of a niche — step, Gaussian, quadratic and ramp functions â€“ are illustrated in both graphical from and R code. The chapeter concludes with the ACF and lag plots, in one or two dimensions.

Data: This chapter demonstrates how to manage simple biodiversity databases using R. By using data frames as tables,
it is possible to replicate the basic spreadsheet and relational database operations with Râ€™s powerful indexing functions.
While a database is necessary for large-scale data management, R can eliminate conversion problems as data is moved between systems.

Spatial:
R and image processing operations can perform many of the
elementary spatial operations necessary for niche modeling.
While these do not replace a GIS, it demonstrates that generalization of arithmetic concepts to images can be implemented simple spatial operations efficiently.

## Part 2: Modeling

Theory: Set theory helps to identify the basic assumptions
underlying niche modeling, and the relationships and constraints between these
assumptions. The chapter shows the standard definition of the niche as
environmental envelopes is equivalent to a box topology. It is proven that when
extended to infinite dimensions of environmental variables this definition
loses the property of continuity between environmental and geographic spaces.
Using the product topology for niches would retain this property.

## Statistical Predictive Models in Ecology: Comparison of Performances and Assessment of Applicability

Authors: Can Ozan Tan, Uygar Ozesmi, Meryem Beklioglu, Esra Per, Bahtiyar Kurt