GARP Modelling System User's Guide and Technical Reference

The following manual has been moved from its original location.

by Karen Payne and D.R.B. Stockwell

Introduction

Welcome to the GARP Modelling System (GMS)! GARP is an acronym for
Genetic Algorithm for Rule Set Production. The GMS is a set of modules
primarily designed for predicting the potential distribution of biological
entities from raster based environmental and biological data. The modules
perform a variety of analytical functions in an automated way, thus making
possible rapid unsupervised production of animal and plant distributions.
This manual describes the use of the software which has widespread application
where a simple to use, robust and informative modelling system is needed.

The package you have just downloaded consists of three parts. First,
this manual which is intended to serve as a gentle introduction to those
interested in using the GMS. Secondly, you have all the programs necessary
for using the GMS. Finally, this package also contains a small example
data set and scripts for running them, do.x and do2.x. These examples are
referred to in this tutorial paper.

  • copyright
  • caveat
  • revision history
  • retreiving and installing garp
  • contacting the author
  • other sources of information
  • conventions used in this manual
  • a note on the examples provided
  • general structure of analytical systems
  • general structure of this manual
  • parameters
  • rasteriz
  • presampl
  • image

Administrative matters

Copyright

This program is the copyrighted and intellectual property of David Stockwell. Permission is given to use this program for evaluation. For regular
use a fee may be charged.

Caveat on the use of the GARP modelling package

The author disclaims any warranties of fitness of programs for any particular
problem.

Revision history

This is version 1.0, the first public release version of GARP.

Downloading and installing the GMS

GARP is available for download from biodi.sdsc.edu

Installation

This package is a C coded version of an earlier system called Ttree
which was written in Turbo PROLOG for MSDOS. This version has run successfully
on Sun workstations and IBM PC’s (using Linux).

Installation in UNIX

The following commands should compile the programs. Modify the makefile
for system specific compilers and installation destination.

Unzip, untar and compile.

  > gunzip garp-1.0.tar.gz
  > tar xovf garp1.0.tar
  > make all

Installation in DOS

For DOS installation use the syntax:

  >pkunzip -d garp_1.0.zip

in your garp directory.

After you install garp you should have on your system the following
files on your system:

  • executables
  • rasteriz*
    initial*
    presampl*
    explain*
    predict*
    verify*
    image*
    translat*
    
  • documents
  • CAVEATS
    COPYRIGHT
    FAQ
    README
    garp.txt
    formats.txt
    rasterize.txt
    initial.txt
    presample.txt
    explain.txt
    predict.txt
    verify.txt
    image.txt
    translate.txt
    index.txt
    
  • batch files
  • mod.x*
    multi.x*
    
  • data
  • Example/
    Example2/
    

Contacting the author

Bugs, comments, money, contracts and general praise of the GMS can be
directed to the author via one of the following contacts:

David Stockwell
davids99us at yahoo.com

What published information and applications are available?

Biodi also has supporting documentation in addition
to links to other relevant sites and can be viewed at: http://biodi.sdsc.edu.

You may also wish to joint the GARP public mailing list by sending the
message:

> subscribe garp 

to

majordomo@sdsc.edu

The list owner is David Stockwell and can be contacted at either of
the following email addresses: davids@sdsc.edu

If you are interested in learing more about genetic algorithms you may
view the Genetic
Algorithms FAQ
from the newsgroup comp.ai.genetic.

A note about the notation used in this manual

File names are given in quotation marks (eg. "test"). Commands
that you would type in at your terminal are preceeded with a > indicating
the machine prompt. cd means change directory.

A note on the examples provided

Example programs are included for testing and tutorial. These are contained
in the directories Example and Example2. To run the GMS on these examples
on UNIX or DOS machines cd to the "Example" directory and type:

  > do

This example uses most of the model development tools in their typical
usage. The output is an ascii map of a 20×20 distribution of Greater Glider
density. To run the second example cd into "Example2" and type:

  > do2

This example applies the rules developed in "do" to a 140×100
map to predict the density of Greater Glider over a larger area. The output
are images in portable grey map (pgm) format. You will need an image viewer
such as xv to view or convert these images.

It may be useful to examine the files "do", "do2"
and "parameters" as an example of how to run GARP in a batch
mode. A typical batch file for a UNIX machine, the "do" file,
is shown below:

set -x
echo "Datadir Example" > paramete
cp Example/layer00 .
./presampl -prop
./initial
./explain
cat test | ./verify
./predict | ./image -pnm
./translat
cat predI.pgm

Each line of the batch file is a separate command to the operating system
which is executed in order in the batch file. The pipe symbol (|) directs
the output of one program into the input of another. The redirect symbol
(>) directs the output of one program into a named file.

General structure of this manual

This manual is designed to guide you through the GARP Modelling System.
The GMS contains tools for database modelling and visualisation. Provided
the requisite files are available each module can be run independently
at any given time. The intent is to describe how the GMS runs and explain
how and why this modelling system performs the way it does.

The GMS modules and typical order of application are:

  • rasteriz – converts point data to contiguous raster data layers
  • presampl – controlled sampling of data to provide data sets of uniform
    size, prior probability, and spatial coverage. It produces training and
    test sets for the GARP modelling system
  • initial – develops an initial model
  • explain – applys the genetic algorithm program to improve the initial
    models and output a set of the best models
  • verify – evaluates the models produced in explain on an independent
    test data set
  • predict – takes the model and and produces a prediction for each value
    and cell of the raster data set
  • image- converts the predicted images into standard image formats
  • translat – provides natural language translation of the models
  • While the main form of this analysis is statistical, it has a number
    of outstanding features:

  • the tools are low impact and fast. This allows interactive applications
    to be developed.
  • the data formats are open and can be integrated with scripts. This
    allows applications to be developed where analysis bridges the gap between
    databases and graphic visualisation packages.
  • produces a set of models including expert system rules, BIOCLIM and
    logistic regression. These are evaluated and applied at the same time allowing
    simultaneous comparison of different methods.
  • induces a wide range of model types from data, optimises them for accuracy
    and significance, and them ranks according to quality. This provides a
    robust system which attempts to provide the best patterns that can be found
    in the data.
  • The package up to the present has been used for develop predictive models
    of the distribution of biological species from survey data, although many
    other applications are possible.

    General structure of the analytical system

    The GMS is a "production line" type architecture, with a linear
    configuration of components which provides an efficient, simple structure.

    Within the range of alternative architectures for spatial information
    systems the GMS can be classified as a loosely coupled system (Abel et.
    al. 1992). Loosely coupled or open systems support re-configuration and
    therefore ease of integration and customisation. For example, the recorded
    sightings can be extracted from an ORACLE data base, and the results displayed
    in browsers such as NETSCAPE.

    The Database

    Data formats and preparation

    There are three stages to data preparation. You begin with geocoded
    flat files, turn these into a series of "layer" files and then
    sample the layers to create training and testing datasets.

    The challenge modelling biological pattern is to take a set of site
    based records of a species and produce an accurate map of the pattern of
    the potential distribution. The records are scattered unevenly throughout
    the region and points of absence may or may not be recorded (Fig 1).

    Figure 1: An example of biological survey data. Points where a species
    occurs are shown in white and points where it doesn’t occur in black.

    The "data" file

    The data that you wish to use in modelling must be in a "data"
    file called a point coverage. The point coverage is an ascii file, the
    first two columns contain the geocode (longitude and the latitude or easting
    and northing), and the following columns contain an abundance value for
    a species, or value of a variable, eg:

    150.775 -35.005 0 0 1178 195 169 0
    148.005 -35.005 1 3 824 204 138 5
    ...
    

    These values can originate from any source; most database and GIS applications
    can output points in this form. This format is also known as a point coverage
    in ARCINFO or as a geocoded flat file in other parlance.

    The "parameters" file

    Your must also create a file called "parameters" in your working
    directory. This file serves two functions. First, it stores information
    about each of the variables you use in your experiment. Secondly, it contains
    parameters for controlling the options available to the programs in the
    GMS. The listing below shows an example of the minimal contents of a parameter
    file for two independent variables.

    Columns 0 20 2
    Rows 0 20 2
    Variable 0 ExM 0 3 c degC %2.0f
    Variable 1 Dev 0 2 c % %2.0f
    Variable 2 StC 0 1 c mm %2.0f
    

    The meaning of the "parameters" file is:

    Columns (x min) (x max) (increment)
    Rows (y min) (y max) (increment)
    Variable (column) (name) (min) (max) (type) (units) (format)
    Variable ...
    

    The Columns and Rows parameters control the spatial information for
    mapping of point coverages into layers. The first number is the minimium
    spatial extent, the last is the maximium spatial extent, and the last is
    the cell size. The size of the layer is determined from the equation:

    (max - min)/size
    

    Together the first values of the rows and columns parameters should
    define the upper left hand corner of the image. This means that the order
    of the geocodes in the "parameters" file depends on the reference
    grid that you are using and correspondingly which hemisphere you are working
    in. For example, if you are using easting and northing measures characteristic
    of UTM projections in the southern hemisphere then the order of the geocodes
    will look something like:

    Columns 250000 265750 30
    Rows 6069265 6053515 30
    
    

    Here the values of the eastings (or columns) increase as you move east.
    The northings (rows) decrease as you move away from the equator in the
    southern hemisphere. In contrast, if you were working with lattitude and
    longitude measures then the first value of the rows parameter would be
    smaller then the second value.

    The remaining parameters are details of the model variables: variable
    number, name (short identifier), minimum value, maximum value, types (categorical,
    ordered or continuous), units of measure, and printf printing format. Note
    that because the first two fields of your "data" file are reserved
    for their geocodes, Variable 0 corresponds to the third column of your
    "data" file, Variable 1 corresponds to the fourth column of your
    "data" file etc. The printf format determines the how the variables
    will be printed in the final model, that is, how many digits will be represented,
    and follows the conventions of C programming.

    A parameters file described above typically resides in a directory
    containing all the data. It is possible to have another "parameters"
    file in a working directory with the line:

    Datadir [absolute path name]
    

    The Datadir entry points the applications to another directory where
    the full parameter file (as described above) is located. This feature allows
    data to be located in a single location away from tempory directories where
    the program may be running. To reiterate, this type of "parameters"
    file can be put in a seperate directory as long as another file called
    "parameters" is present in the working directory and contains
    a line such as:

    Datadir /usr/Data/Australia
    

    where /usr/Data/Australia is the directory where the "data"
    and the "parameters" file containing the definitions of the variables
    is kept. Alternatively, you could use the command line option

    -data /usr/Data/Australia
    

    when running a module to specify where the data is kept.

    The parameters file in the working directory can contain other lines
    and flags affecting the running of the program. For example, one special
    parameter is the Variables list which specifies the variables to use in
    the analysis: e.g.

    Variables 1,2,5,6
    

    This parameter will cause only variables 1,2,5 and 6 to be used. A full
    explanation of the options avaialbe to the GMS that are controlled by flags
    in the "parameters" file is given in the man pages at the end
    of this manual.

    rasteriz

    The next step in data preparation uses the program "rasteriz"
    to convert your "data" file into a series of binary image files,
    called "layers" which have one byte value per grid cell. Typically
    all variables used in the modelling are layers. This format has a number
    of advantages, the first being compression of information. For example,
    a typical grid of 258×410 contains 106K points, requiring significant memory
    resources if stored as floating point numbers. Storing these layers with
    one byte per cell reduces the amount of memory needed. In practise the
    approximation has not been a limitation.

    The program "rasteriz" maps point data into a byte valued
    spatial grid at a given scale. A cell is a single byte, its value determined
    by linearly scaling the point value between 1 and 254. Supose for example,
    that you had a data file that recorded the absence (0) or presence (1)
    of a species at a series of locations. After running rasteriz over this
    data the output files (called "test" and "train") will
    have two values: 1 for absent species and 254 for records where species
    are present. It has been mentioned that byte values are represented efficiently
    on computers, contributing to computational efficiencies. Other advantages
    are that the normalizing and scaling of the variables into single bytes
    reduces the effects of differing magnitude between variables that can effect
    some analytical techniques.

    Mapping the data to scaled byte values also has the effect of changing
    all variables to a common type. Rasterize recognises three types of variables
    which is recorded in the type parameter of the "parameter" file.
    Each type is treated differently:

    • for species or presence/absence data,which are denoted with an "s",
      a cell takes a presence value if one or more points falls within it, otherwise
      it remains zero.
    • for categorical data denoted with a "c", a cell takes the
      value of the mode of the values of the points that fall within it.
    • continuous or real data are marked with an "r" and the cell
      takes the mean value.

    The mapping also has the effect of bringing all of the data to the same
    scale. Spatial auto-correlations caused by localised intensive sampling
    or duplicates are eliminated. The magnitude of the effect can be seen in
    the decrease in the number of effective data points. In an example of the
    output from rasterize below, 58 data points are read but only 31 points
    are recorded in the data grid:

    RASTERIZE - point coverage to gray layer
    No of data points read 58
    Raster cell size is  0.17x0.17  degrees
    Presences 31 Absences 0
    

    In setting up an application the rasteriz program is used to prepare
    the environmental layers for subsequent analysis. This has already been
    done in the example in the distribution package, the layers contained in
    the directory Example.

    Figure 2: Examples of independent environmental variables in binary
    layer form. The values range from the minimium value in black to the highest
    value in white. The layers shown above from top left are geology, annual
    temperature, annual rainfall and latitude.

    A typical implementation uses over 30 environmental (predictor) data
    layers, each containing variables such as temperature, rainfall, geology,
    and topography. These layers remain constant. These layers are named layer01
    to layer30, and are the independent variables for the model development.

    Creating "layer" files from the data can be done an number
    of ways.

    1. creating a single "layer" file one at at time
    2. creating a series of "layer" files from all variables in
      a point coverage
    3. creating a "layer" file from an arc grid file

    Each of these three proceedures are detailed below.

    1. creating a single "layer" file one at at time

    The object of this modelling system is to take this point data which
    is referred to as the dependent variable, and create a model which relates
    it to your suite of independent variables.

    Say for example that you wanted to predict a tree species based on a
    set of field observations. You would have a geocoded data set of observations
    of the dependent variable which cover only some of the rasters in your
    dataset. Additionally you will need a set of predictor variables.

    The general, minimal form of the "parameters" file looks
    like:

    Columns (x min) (x max) (increment)
    Rows (y min) (y max) (increment)
    Variable (column) (name) (min) (max) (type) (units) (format)
    Variable ...
    

    And in this instance the "data" file of observational data
    may look something like:

    250013 6053501
    250605 6053720
    250447 6067253
    etc.
    

    where each location indicates a known presence (absence is not recorded).
    Alternatively, the "data" file may be of the form:

    250013 6053501 0
    250605 6053720 1
    250447 6067253 0
    etc.
    

    where a 0 indicates a known absence at a site and 1 indicates a known
    presence.

    The "parameters" file which describes this set of dependent
    or observational data may look something like:

    Columns 250000 265750 30
    Rows 6069265 6053515 30
    Variable 0 Eumac 0 1 s species %2.0f
    

    Where "Eumac" indicates that the first variable (located in
    the third column) in the "data" file is the variable in question,
    it is a species variable and may have values 0 or 1. Other types of variables
    are "c" for categorical variables and "r" for continusous
    variables. This allows you to model species presence or absence or abundance
    or different types of variables. The location of "species" in
    the parameter file is reserved for the unit of measure.

    In this case the "rasteriz" program is run using the following
    syntax:

    > cat data | rasteriz 

    or

    > rasteriz -file data
    

    Each time you run "rasteriz" on a single variable you generate
    an output file layer called "layer00." If "layer00"
    already exists in your current working directory and you run "rasteriz"
    then "layer00" will be overwritten." "layer00"
    is always the dependent variable being modelled. In this example "layer00"
    is the layer representing the presence or absence of the Eumac species.

    2. creating a series of "layer" files from all variables
    in a point coverage

    The "data" file for the independent variables may look something
    like:

    250000 6069250 26.20 6.5862 6
    250030 6069250 24.52 6.1418 6
    250060 6069250 22.91 6.2675 6
    etc
    

    In this case the first two columns are again the geocodes. The third
    column is elevation, a real variable ranging from -1 (no data) to 281.51
    meters. The fourth column is slope, another real variable ranging from
    -1 (no data) to 209.3667 and measured in percent. The final predictor variable
    is geology type, a categorical variable that may take integer values from
    0 to 6.

    The "parameters" file associated with the independent data
    may look something like:

    Columns 250000 265750 30
    Rows 6069265 6053515 30
    Variable 0 ele -1 281.51 r m %2.2f
    Variable 1 slo -1 209.3667 r perc %2.4f
    Variable 2 geo 0 6 c type %2.0f
    

    To rasterize all of the variables in this point coverage at once use
    the syntax:

    > rasteriz -file data -all
    

    This will create a series of layer files: "layer00," "layer01,"…
    "layern". There is one layer file for each of your predictor
    variables. In the next section you will create models using the dependent
    and indpendent variables. When you do this the dependent variable being
    modelling will always be "layer00." For this reason if you choose
    to rasterize all of your independent data at once you should rename each
    of the layer files so that "layer00" becomes "layer01",
    "layer01" becomes "layer02" etc. This allows "layer00"
    created in the previous step to be the dependent variable in your model.
    Additionally, you will also want to modify the "parameters" file
    so that it reflects the new layer names. In our example the new "parameters"
    file will look like:

    Columns 250000 265750 30
    Rows 6069265 6053515 30
    Variable 0 Eumac 0 1 s species %2.0f
    Variable 1 ele -1 281.51 r m %2.2f
    Variable 2 slo -1 209.3667 r perc %2.4f
    Variable 3 geo 0 6 c type %2.0f
    

    3. creating a "layer" file from an arc grid file

    An ARCgrid file has a header describing the size of the file and its
    geographic location, followed by the data points.

    ncols   
    nrows   
    xllcorner       
    yllcorner       
    cellsize        
    NODATA_value    
    0 0 0 0 0 0 0 1.2 0 5 6 6 6 7.2 0 ...
    

    In the case where the max and min values of the ARCgrid file are not
    known we first run option -layer 2 to find the value. Dummy values can
    be used for the max and min in the parameters file.

    > cat data | rasteriz -layer 2
    Missing val
    Min is 0, max is 54.0
    Columns 0 20 1
    Rows 0 20 1
    No of columns 20, no of rows 20.
    

    The output lists the max min values, and an appropriate line for the
    Rows Columns and Missingval lines in the parameters file. After amending
    the parameters file to reflect these values the ARCgrid file can be input
    using option -layer 3.

    >cat data | rasteriz -layer 3
    

    presample

    The final step in data preparation is to take the "layer*"
    files and create a training set of data with which we can induce a model
    and an independent test set which allows us to assess the model’s performance,
    ie. we can count how many mistakes the induced model makes on the "test"
    file. The presample program is run using:

    >presampl
    

    Note there is no "e" at the end of this command.

    Running the program presample will produce two point files form the
    stack of layers developed in the rasterize program. These files are called
    "train" and "test". The "train" file is used
    for training the genetic algorithm and the "test" file is used
    for testing it’s overall accuracy. As presampl is running it will print
    the following diagnostics to the screen:

    PRESAMPLE - from data to train and test sets
     201386 points with pred 0
     85 points with pred 1
     119 points with pred 254
    presences 119 absences 85 background 201386 no_of_data 201590
    102 points to train and 102 points to test
    

    The "test" and "train" files produced from presampl
    look like:

     0 0   85 254   1 102 169  85   1 254
     0 0  254 254   1 203 169 254   1 254
     0 0  254 127   1 152  85  85   1 254
     0 0  169 127   1 152 169 169   1 254
     0 0    1 124   1   1   1   1   1 254
     0 0  169 254   1 254 254 169   1 254
     0 0  254 254 254 203 254 169   1 254
     0 0    1 254   1   1 254 169   1 254
     0 0  254 254   1 152  85 254   1 254
     0 0   85 254   1 203 169 169   1 254
     ...
    

    The columns are the x and y locations of the point, and the values of
    each of the bytes in the layers at a particular location with the dependent
    variable, or layer00, at column 3.

    The default behaviour of the presampling algorithm is as follows:

    • a random sample with replacement, of
    • a fixed number of points from outside the masked area,
    • with equal proportions of the values of the dependent variable
    • plus a "background" value.

    The GMS allows the user to choose whether or not to sample with replacement,
    that is, the user must decide if the "test" and "train"
    files are limited to the number of observations that were actually made
    or if these files should be made larger by allowing single data points
    to be represented more than once. This option is controlled by the Resamples
    and the Replace flags in the parameters file (or by the -resamples and
    -noreplace command line options). There are advantages and disadvantages
    to both approaches and these arguments are outlined below.

    Using the command line option -prop or the parameters flag Propflag
    <0,1>

    There are two primary reasons why we would choose to control the proportions
    of the dependent values. First, a data set of sightings of a species will
    by composed of values in varying proportions. Varying proportions make
    it difficult to compare the predictive accuracy of models on different
    species. For instance, in rare species, if the proportion of sightings
    is very low, such as less than 5% of the total, the strategy of predicting
    the absence of the species everywhere will have an expected accuracy of
    95%. Presampling the data to an even distribution i.e. 50% presences and
    50% absences, allows consistent comparison of accuracy between species.

    Secondly, we will see later that the GMS uses a measure of "significance"
    in determining how good a rule is and if it should be maintained in the
    final model. The measure of significance used in the GMS calls on the normal
    approximation to the binomial distribution. This estimate becomes more
    inaccurate at either end of the abundance scale. That is, the estimate
    becomes more inaccurate when the number of occurances of a species is either
    very rare in the dataset (probabilities near zero) or if it is extremely
    common (probabilities near one).

    Using the command line option -noreplace or the parameters flag Replaceflag

    Small numbers of records occur with rare species or those of a very
    restricted range. When this occurs sampling with even proportions cannot
    occur without replacement of the record. While presample supports the option
    of sampling with or without replacement, sampling with replacement provides
    generality by allowing model development using the range of possible frequencies
    of records, including from a single datum.

    The disadvantage of sampling with replacement is that you are in essesence
    saying that the species and it’s associated attributes occur more often
    than it actually does in the landscape. Ultimately, we would like a modelling
    system that performs well on rare, common and those species in between
    these two extremes. The generality of being able to compare models as if
    all plants occured with the same frequency in the landscape is achieved
    at the expense of the assumption of independence of the probability of
    selecting data points.

    Using the command line option -resamples or the parameters flag Resamples

    Allowing arbitrarily large numbers of data points, can lead to long
    computation times for little gain in predictive accuracy. The default for
    resampling with replacement is 2500 data points. This typically provides
    sufficient information for the system to have data points from throughout
    the range of possible configurations. If you wish to change this limit
    you may specify a new limit in either the parameters file or on the command
    line.

    The Resamples X line in the parameters file (or -resamples X on the
    command line) will write X number of points to each the "test"
    and "training" files. If the resample option is used by itself
    then the "test" and "train" files will have equal occurances
    of presence and absence even if this is not the case in the original data.
    Specifying a Resamples X value in the parameters file (eg. Resamples 500)
    will write X data points each to the "test" and "train"
    files even if you have, say, only 100 points in your data file.

    In summary, the command line option -noreplace or the parameters file
    line Replaceflag 0 will set the sampling to non-replacement. When you turn
    the Replaceflag off (ie. set it to 0) then you should still specify a Resamples
    value in the parameters file to control how many of your points will be
    written to the "test" and "train" data sets. In this
    case the Resamples value will be less than the number of points in your
    data.

    The Modelling Tools

    To model you create a preliminary set of rules, refine them by using
    a genetic algorithm and then apply it to your test data set to assess it’s
    performance.

    initial

    The initial program is run using

    > initial
    

    The training set generated by presample is input to the next program
    initial. This produces an initial model – a good initial starting point
    for the next stage of developing a model. The initial model, output in
    the file "prelim", is a set of rules.

    Each rule, which is a model in itself, is an if-then statement used
    for making inferences about the values of the variable of interest. The
    sets of rules developed by the GMS are more accurately described as inferential
    models rather than mathematical models. Inferential modelling differs from
    mathematical modelling in that the models are more closely related to logic
    than mathematics and the basic process is logical inference rather than
    calculation.

    The general form of a rule is as follows:

    Given that if A then B, and A is true, then predict B.

    The statement denoted as A is called the precondition while the one
    denoted as B is called the conclusion. The accuracy of a rule is determined
    from simple probability calculations. A set of data can be identified with
    the precondition of a rule (eg. the set of data with rainfall between 600mmm
    and 700mm). The probability of occurrence of the species can be calculated
    from the number of these cells in which the species occurs, divided by
    the number of cells selected by the precondition. For those who are unfamiliar
    with probability calculations a brief introduction is provided in the following
    section. If you are familiar with this topic you may wish to skip
    to the next section.

    Overview of logic and
    probability

    Logical statements are elementary propositions which can be either true
    or false. These propositions are designated by symbols, usually capital
    letters. Compound propositions are formed from elementary propositions
    through modification by negation or through connectives as listed below.

    Logical deduction also requires a rule of inference from which the true
    value of unknown propositions can be determined. The main such rule is
    called modus ponens, also shown below.

    
    Name          Symbol     Example         Natural language
    
    propositions    P,Q,    P               x>42, x is between 0 and 5, y=3
    
    not             !       !P              x is not between 0 and 5
    
    and             ^       P^Q             x=4y=5
    
    or              v       PvQ             x=3 or y=5
    
    implies         =>      P =>Q           if P then Q
    
    iff             <=>     P<=>Q           P if and only if Q
    
    rule of         |=      P, PQ |= Q      if P then Q and P is true, conclude Q is true
    inference                                                                                  

    Basic elements of Probability

    The basic elements of probability are sample points and events.

    Sample points are observations or measurements of the world (e.g. when
    a species occurs in a cell) and are generally denoted with the capital
    letters E1, E2, E3, etc. These sample points can be counted.

    A specific collection of sample points is referred to as an event and
    is generally denoted with a single capital letter A, B, C, etc.

    The sample points that compose events can be counted by summing the
    observations. In this manual this is denoted: #(A) (e.g. number of cells
    in which species A occurs). For example, let an "x" in the next
    figure represent a recorded presence of species A in a raster dataset.
    This is equivlant to saying each "x" is a sample point in the
    event "species A is present."

    The probability of the outcome of any event is denoted P(A) and is equal
    to the sum of the probabilities of the sample points in A. A particular
    event is said to occur if any sample point in the event occurs. That is
    if A is observed #A times (a sample point within A occurs) then the probability
    of A, P(A) is:

    P(A)=#A/n

    where n is the total number of data points.

    In our example then:

    P(A) = #A/n = 6/36 = 0.17

    In general, the probability that an event A will occur is between 0
    (there is no chance of the event occurring) and 1 (it is certain that the
    event will occur).

    If two events are affiliated in such a way that the occurrence of one
    indicates something about the occurrance of the other then the two events
    are said to be related. The magnitude of the relationship is given by a
    conditional probability. A conditional probability is the probability that
    event A will occur given that event B has occured and is written P(A|B).

    P(A|B) can be calculated by:

    P(A|B) = P(AB)/P(B)

    P(B) is defined above and P(AB) is the intersection of events A and
    B. That is, P(AB) is the event that both A and B occur.

    Any sample point that occurs simultaneously in both A and B indicates
    that the event AB has occured. P(AB) can be calculated by using the Multiplicative
    Law of Probability which states that given two events A and B, the probabilty
    of the intersection AB is:

    P(AB) = P(A)P(B|A)
          = P(B)P(A|B).
    

    In the notation of this manual then:

    P(AB) = P(AB)/P(B) = (#AB)/#B 

    Where #AB are the number of cells which occur in both A and B.

    Continuing with our example, lets say that we have, in addition to the
    survey data represented in figure 1 above, an image with a value of an
    attribute estimated for each raster in the dataset.

    You could imagine that this (simplistic!) raster image represents two
    geology types found in your study area. Now in order to make a prediction
    you want to look at the relationship between the geology types and the
    occurance of species A. We defined the probability of A given B above:

    P(A|B) = P(AB)/P(B) = # of cells which both A and B occur/# of cells
    where B occurs = #AB/#B

    So in the case that B=1

    P(A|B) = 5/20 = 0.25

    While in the case that B=2

    P(A|B) = 1/16 = 0.06

    The explanation of the above problems are presented in the language
    of probablity theory. For those of you acquainted with statistics this
    particular problem may sound like the familiar binomial theorum. In order
    to continue with this discussion we should first define a few terms.

    A discrete random variable is one that can assume a countable
    number of values. We may denote a discrete random variable as y and in
    the case of our example let y equal the number of occurances of species
    A in our dataset. The probability of the occurance of A is denoted p(y)
    and is equivlant to P(A) when stated according to probability theory.

    A particular type of discrete random variable is the randomly distributed
    binomial variable
    . This variable can take the value of 0 (in our example
    this may indicate the absence of species A) or 1 (present). This is written
    in short hand notation as:

    y ~ (n,p)

    Where n is the number of trials (the number of points in your dataset)
    and p is the probability that the binomial variable will occur.

    In the most general terms then, the example given in figures 1 and 2
    above can also be described in statistical terms as a binomial experiment.
    A binomial experiment:

    1. Consists of n identical trials.

    2. The outcome of each trial falls into one of k classes. For binomial
    experiment k=2.

    3. The proabiliy that the outcome of a single trial will fall in a
    particular class is denoted pi where i=1,2. This means that both classes
    have an associated probability of occurance. Note that the sume of all
    pi’s is 1. ie. p1 + p2 = 1.

    4. We denote ni where i=1,2 (eg. n1, n2) as the number of trials in
    which the outcome falls into class i. Note that n1 + n2 = n which is the
    total number of data points in your dataset.

    5. The probabilities pi remain the same from trial to trial. And finally,

    6. The trials are independent.

    The general problem of predicting the presence or absence of a species,
    as we have described it above, satisfies all the definitions of a binomial
    experiment except the last one. Due to spatial autocorrelation we can not
    truly say that the trials are independent. The notion of spatial autocorrelation
    can generally be thought of as the observation that things (natural phenemon)
    are more similar to those things which are spatially close to them and
    more dissimilar to those things farther away. For this reason the notion
    of an independent trial is violated. While this situation is not strictly
    optimal, the problem of spatial autocorrelation is an entire subfield of
    geographical studies and it’s solution is beyond the scope of this manual.

    Calculating p for our binomial distribution y~(n,p) when n is large
    is labor intensive. However, we can avoid these calculations because under
    certain circumstances the binomial distribution can be estimated using
    a normal or gaussian distribution. In other words, when n is large and
    p is not too close to zero or one, the binomial probability distribution
    has a shape that is closely approximated by a normal curve which has a
    mean = np and a standard deviation = sqrt(npq) where q = 1-p.

    The following normal approximation to the binomial distribution follows
    from (Mendenhall et. al. 1981 pp 550):

    With 1 degree of freedom, large n (>4), k=2, and pi not too close
    to zero or one then the following calculation for an outcome i is approximately
    standard normal z = ni-npi/sqrt(npi(1-pi))

    Later we will use the binomial theorum to evaluate the models we develop
    with the GMS.

    Event Space

    Below is a diagram representing a GIS scene combined with a landcover
    layer. Imagine you have to determine the relationships between two attributes,
    say perhaps species distribution and forest cover.

    Below is an idealized version of the example above.

    An event space is a graphical representation of sets of events. The
    event space has a parallel in the spatial representation of GIS. In a trial
    consisting of random sampling of points in the event space, the probability
    of a randomly chosen point satisfying a proposition A, say, is equal to
    the area of shape A.

    Inferences for Prediction

    To predict, according to the Concise Oxford Dictionary is to foretell
    or prophesy. To predict accurately is to correctly foretell an outcome.
    If we wish to foretell an event accurately, such as whether a cell is A
    or B, then it would be wise to do so when the probability of that event
    is one. One way to do this is to use the rule of deduction. We would expect
    to predict accurately by the rules of logical inference: Given that if
    A then B, and A is true, then predict B.

    In probabilistic terms we would expect that if the probability of B
    given A is 1 (P(B|A=1) and the probability of A is 1 (P(A)=1) then the
    probability of B is 1 (P(B)=1). The problem of prediction then consists
    of finding situations in which the conditional probability P(B|A) is one,
    or at least very high, because from them we can construct the proposition
    A=>B and use it for inference.

    These propositions, which I shall call rules from now on are frequently
    used in our natural language, tautologies – e.g. if you go in the water
    then you will get wet. Below are some examples of different types of rules.

    Inclusion – this species occurs in rainforest gullies (i.e. if species
    occurs then it is in a rainforest gully),

    Causation – friction causes heat

    Probabilistic causation – if you smoke then your likelihood of getting
    cancer is increased.

    You can see that there are a variety of types of rules that resemble
    the basic if-then pattern. The analysis of the exact relationship between
    them has occupied logicians for centuries and doesn’t concern us here.
    For illustrative purposes I would like to examine simple cases of rules
    that can be validly derived from the range of configurations of two circles
    in a square, and evaluate their value for prediction.

    The first situation where the two circles overlap illustrates the calculation
    of the conditional probability P(B|A) from the event space diagram.

    P(B|A) = P(BA)/P(A) = #(BA)/#(A) = 0.2 (approximately)

    This is much less than 1 and so the rule A=>B (read: "if A then
    B") would not be inferred.

    P(B|A) = P(BA)/P(A) = #(BA)/#(A) = 0/#(A) = 0

    This is 0 and so the rule if A then not B (denoted A=>!B) would be
    inferred.

    The next two situations are a little more complicated.

    P(B|A) = P(BA)/P(A) = P(A)/P(A) = 1

    P(B|A) is 1 so the rule A=>B would be inferred. The inverse does
    not hold however.

    P(A|B) = P(BA)/P(B) = P(A)/P(B) = #(A)/#(B) = 0.5 (approx)

    This is much less than 1 and so the rule B=>A would not be inferred.
    Note, however, that this rule is adequate for predictive purposes. Consider
    the situation that occurs when the factor B that predicts A envelopes A.
    In the spatial analogue it means that the environmental attribute (B) used
    to predict A with high probability does not apply in all cases where B
    is possible. So when B occurs we do not necessarily expect A to occur,
    even though it is a good predictor.

    The situation where A=>B and B=>A are true only holds when the
    two shapes are coextensive.

    This is clearly a special condition that is vary rarely met when performing
    overlays. This accounts for some of the power of using rules for prediction.
    The rules only need to apply part of the time, or locally, rather than
    all of the time, or globally. Generally more local rules with higher probability
    can be found than global rules.

    Applications to Modelling

    Now lets examine a few applications of logic as they pertain to modelling.
    The first is the BIOCLIM modelling technique used for modelling the distribution
    of species. The basic idea behind this technique is the uncontestable notion
    that living things have environmental tolerances beyond which they cannot
    survive. The model is developed by enclosing the occurrences of the entity
    of interest in a box or envelope defined by percentile ranges of the variables
    of importance, such as climatic variables.

    Using the x and y axes as climatic variables leads to a typical event
    space diagram below.

    The box defined by B is often used to predict the distribution of the
    species observed at A. When the diagram is compared with the previous diagrams
    it can be seen that the situation is very similar to the one where one
    A was completely enclosed within B. From this we inferred that A=>B
    and not B=>A. That is, the environmental range B is validly predicted
    from the occurrences A, rather than the occurrences being predicted from
    the environmental range.

    Again, consider the inverse case from A=>B which is B=>A; If a
    point is outside the climatic range of the species then the species will
    not occur. But the inference that the species will always occur within
    the climatic range is logically invalid. BIOCLIM therefore predicts the
    absence of a species from the environment, but not the presence of a species.

    When does BIOCLIM predict the presence of a species? In practice the
    probability of the species occuring given points within the range approaches
    one when the occurences of the species fill the entire climatic range.
    This can occur when the species has a very restricted distribution, such
    as a single occurrence or unique habitat. In this case the distribution
    and range are coextensive.

    The second application pertains to all models where the predictive accuracy
    is quoted to support the accuracy of a particular models. You see statements
    such as "this model x gave a high accuracy of 0.95 and this was better
    than model y which only gave 0.85." Below is a situation where the
    rule predicts with a high accuracy.

    Where B is the whole of the event space S the probability of B given
    A is very high, greater than 0.95, one in fact. If one relied on the conditional
    probability alone then one would say that A=>B is a good rule. However
    it can be seen that any area within the space would give a high probability.
    This is because the prior probability of B is already high.

    To use the predictive accuracy as some guide to how "good"
    a model is one needs an idea of how hard the task of prediction is. Predicting
    B above is not difficult, its probability in one, like death and taxes.
    Predicting a very rare event, one with low prior probability, is much harder.

    In the smoking causes cancer example, the rule is given experimental
    credibility because it has been shown that smoking raises the probability
    of cancer, i.e. the incidence of cancer in smokers is greater than the
    incidence of cancer in the general population. P(B|A) > P(B)

    This relationship can be decomposed as follows. P(B|A) > P(B)

    P(BA)/P(A) > P(B) using definition of conditional probability,

    P(BA) > P(B)P(A) as probability is always greater than zero,

    which is familiar in the form P(BA) = P(B)P(A) an identity describing
    the independence of events A and B. Thus the relation above occurs when
    there is a positive dependence between events. In statistical terms this
    is equivalent to a positive correlation between variables. Positive correlation
    is often used as an estimate of goodness of regression models for example.

    Interestingly, satisfying the relationship above does not necessarily
    lead to accurate prediction. The diagram below illustrates a case where
    P(B|A) is much greater than P(B) but P(B|A) is low. The diagram below shows
    that a model with a high correlation does not neces