Niche business becoming a commodity.

Niche products, niche businesses, niche markets — small businesses are often qualified as surviving in niches, finding their niche, defining or redefining their niche. Most people think entrepreneurs start their own business, but most buy running ventures, like franchises, already in surviving niches. Steps to understanding and predicting quantitative business niches are in early stages.

Here are some sources of numeric information that could be used. Web listing of businesses for sale on the web have improved the tradability of all assets, including businesses themselves to the point they are becoming more like commodities. Here are some sources of numeric information.

Businesses for Sale “Connecting Buyers and Sellers of Businesses” is an international source of businesses for sale.

BizBuySell is “The Internet’s Largest Business For Sale Marketplace”. This site includes ‘fire sales’ of assets, and business liquidations. Includes cost, gross and net return data.

BizQuest “A business for sale marketplace” is a marketplace of 1000s of businesses for sale and includes resources for buying a business or selling your business online. Includes cost, gross and net return data. “Showing you what others overlook” posts off the wall articles on a variety of different topics and tracks how well they do with AdSense in terms of per click averages.

Business Nation “Small business website: information & ideas for starting a small business or home business startup, small business start-up website links, business start-up info…” covers all bases.

Of course Yahoo Directory lists a wide range of resources.

Any more good ones?

How are bimodal distributions created and modeled?

Demetris Koutsoyiannis responds:

I agree that a bimodal distribution is seldom seen. Well, my experience is not from ecological but mainly from hydrological processes but I suspect that the behaviours would be similar.

I have seen claims of bimodality several times but I was never convinced about them as I did not read any argument supporting it except empirical histograms. However, we must be aware that the uncertainty of the histogram peaks is large. A simple Monte Carlo experiment with say a normal distribution suffices to demonstrate that (unless the number of generated values is very high) it is very common to have a histogram with two, three or more peaks. This however is totally a random effect; obviously the normal density is unimodal.

So, I think that one must have theoretical reasons to accept a bimodality hypothesis. As a simple illustration, consider a system described by a random variable X, which switches between two well defined states, 1 and 2 with probabilities p and 1-p. Assume that the conditional density of X given the state is normal in each of states 1 and 2 and denote it f1(x) and f2(x), respectively. Then the unconditional density will be p f1(x) + (1-p) f2(x). It can be easily observed that if the means of the two densities are different, then certain combinations of the standard deviations and the probability p result in a bimodal unconditional density.

Continue reading How are bimodal distributions created and modeled?

What are the conditions for valid extrapolation of statistical predictions? Answer II.

Demetris Koutsoyiannis

Before I attempt to describe my answer, I would like to do some clarifications on the nature of a statistical prediction and mention some points than need caution.

1. A statistical prediction should be distinguished from a deterministic prediction. In a deterministic prediction some deterministic dynamics of the form y = f(x1, …, xk) are assumed, where y is the predicted value, the output of the deterministic model f( ), and x1, …, xk are inputs, i.e. explanatory variables. The model f( ) could be either a physically based one or a black box, data driven one. The latter case is very frequent, e.g. in local linear (chaotic) models and in connectionist (artificial neural network) models.

Now in a statistical prediction we assume some stochastic dynamics of the form Y = f(X1, …, Xk, V). There are two fundamental differences from the deterministic case. The first, apparent in the notation (the upper-case convention), is that the variables are no more algebraic variables but random variables. Random variables are not numbers, as are algebraic variables, but functions of the sample space. This is very important. The second difference is that an additional random variable V has been inserted in the dynamics. This sometimes is regarded as a prediction error that could be additive to a deterministic part, i.e. f(X1, …, Xk, V) = fd(X1, …, Xk) + V. However, I prefer to think of it as a random variable manifesting the intrinsic randomness in nature.

Continue reading What are the conditions for valid extrapolation of statistical predictions? Answer II.

What are the conditions for valid extrapolation of statistical predictions? Answer I.

Thanks to Martin Ringo for this answer.

Let me offer a little of the terminology of forecasting, which, I hope, will make the question clearer. When you are forecasting from some kind of structural model, say Y = f(X1, …, Xk), there is a difference in whether you have to forecast the Xs as well as the Y. If you don’t, it is an unconditional forecast; if you do, it is conditional. For an unconditional forecast, the inference is a pretty straightforward exercise of the classical linear model, at least if you structure relationship is so estimated and a nonlinear version for nonlinear estimations. For a conditional forecast, life can be messy since you have to take into account the distribution of the exogenous, the Xs, as well as the error term.

I recently reviewed a forecast where the analyst treated a conditional forecast like an unconditional one, and consequently underestimated the forecast error by over a factor of two.
Continue reading What are the conditions for valid extrapolation of statistical predictions? Answer I.

How to regress a stationary variable on a non stationary variable? Answer II

Demetris Koutsoyiannis responds:

I think that such questions should not be treated in an algorithmic manner and that it is important to formulate them in the clearest and most consistent manner.

So, let us assume that we have a nonstationary stochastic process X(t) and a stationary process Y(t); I have interpreted here “variable” as process because the notion of stationarity/nonstationarity is related to a (stochastic) process, not a variable. Is the question, how to establish a regression relationship between Y(t) and X(t)? For instance a relationship of the form Y(t) = a(t) X(t) + V(t), where a(t) is a deterministic function of time and V(t) a process independent or uncorrelated to X(t)? Without going into detailed analysis it seems to me that in such a relationship it is difficult to have a constant a(t) = a, i.e. independent of time. Also, V(t) should be nonstationary too. So, while we can consider a time series (observations) of the stationary Y(t) as a statistical sample, we cannot do the same for X(t) or V(t). So, I doubt if there is a statistical procedure to infer a(t) and the statistical properties of V(t) (mean, variance, etc.) which are functions of time too. In addition, I do not find such a relationship useful at all.

Continue reading How to regress a stationary variable on a non stationary variable? Answer II

How to regress a stationary variable on a non stationary variable? Answer I

Martin Ringo responds:

This is the wrong question. The analyst shouldn’t be worried about whether the dependent or independent is stationary or non-stationary. The issue is the error term.

In the Box-Jenkins procedure(s) — or maybe I should call it paradigm — the non-stationary stuff is removed. To me that removal is what is interesting, and all the stuff that Messrs. Box and Jenkins do is the treatment of serial correlation. But be you structural econometrician or time-series statistician, you can merrily regression a stationary variable on a non-stationary variable. You merely have to recognize that there is no impunity in regression. So you still have to check the residuals to see if they behave in a roughly white noise manner.

Continue reading How to regress a stationary variable on a non stationary variable? Answer I

Do maximum entropy time series predictions tend towards the mean?

Thanks for the answer to this question by Demetris Koutsoyiannis

I think that statistical predictions tend always to the mean as time increases. If we use the maximum entropy principle to obtain these predictions, the result depends on the time scale of entropy maximization. For instance, if the entropy maximization is done on the observation time scale, then the prediction may be equivalent to a prediction obtained by a Markov model. However, other settings of entropy maximization (on several time scales) result in long range dependence (as I have demonstrated in my 2005 paper “Uncertainty, entropy, scaling and hydrological stochastics, 2, Time dependence …” in Hydrological Sciences Journal). In the latter case statistical predictions may tend to the mean much slower than in the Markovian case and their confidence intervals would be much wider. Also, in Monte Carlo realizations, the excursions from the mean will be longer and wider.

Continue reading Do maximum entropy time series predictions tend towards the mean?

Comparison of Climate Variables in Species Models

One of the main inputs into a niche model is the environmental variables. Optimizing the choice of variables is important for many reasons, primarily interpretation and subsequent accuracy on independent test data.

In almost all cases to date, annual climate averages have been used in modeling species distributions. Where models have been developed and annual averages of climate compared with monthly variables and others such as vegetation, improvements in the accuracy were attributed to the monthly climate data sets (i.e. greater temporal resolution).

Continue reading Comparison of Climate Variables in Species Models

Three Variable Bayes Net for Species Prediction

The post Bayesian Networks introduced this useful and flexible form of modeling. Here is an example of a Bayesian Belief Net or BBN model of a simple three variable species prediction system.

In Fig 1 the top node is habitat quality for the species. Two lower nodes, the average temperature (Av_Temp) and the vegetation (Veg_Type) at the site, represent simple environmental variables, related to presence of the species. In Av_Temp, the probability of the species being present is a bell shaped curve with a peak at 15 C. Values 0 to 4 represents vegetation types, 0 and 1 where species may be present.

The Classifier node represents the ecological niche model or ENM, and predicts the probability of presence or absence of the species based on the environmental nodes represented as a two dimensional matrix of probabilities, one for each combination of the environmental variables. The Result node, is based on the predictions and the actual values: correct, omission where the species is predicted to be absent but is present, and commission where the species is predicted to be present but is absent. Values for the probabilities in the tables have been entered by hand, but actual data could be used to calculate the probability tables.


Figure 1. A simple BBN used for prediction. When the values of climate and vegetation type are given, the values are propagated forward to infer the presence or absence of the species (Classifier), and the favourability of habitat (Habitat).

BBNs can perform the task of an ENM, predicting the probability of presence or absence of a species based on environmental information. The values of variables Av_Temp and Veg_Type are defined (probability 100). Given these environmental values, the Classifier predicts presence. The Result variable shows the expected accuracy of the classifier under these conditions (84.9%, while 15% of the time, commission error occurs (Classifier predicts presence, but habitat is unfavorable).


Figure 2. The Simple BBN used for investigating the source of errors. The Result variable is set to Omission (100) and changes in probabilities are propagated backwards throughout the net, indicating the values of Av_Temp where omission errors are greatest.

To see the source of errors the Result variable is set to Omission (100). The network updates values of all variables automatically (Figure 2). Classifier predicts absent but Habitat unfavorable, the definition of omission errors. The expected value for vegetation is 0 or 1. The expected value for Av_Temp is either 5 to 10 or 20 to 25. Thus the system shows that omission errors occur at the extremes of the temperature range of the species.

R Sweave Example

The previous post “Writing a Book Using R” described using latex for writing a book, saving time with one master bibliography and other organizational devices. Sweave allows R code to be included in a latex file. This is a good marriage; while latex provides typeset text; R is statistically and graphic oriented.

Here is an example of the overall steps one would take to compile a combined latex and R code document for publication.

Part of the latex is generated by the Sweave package in R, not R itself. For example, you need to first run Sweave(”script.R”) in the R console. This passes the latex you write untouched, and processes the R commands, particularly plot() that produces plots. There is also a package called xtable that produces latex from data.frames, and model result and time series objects. To you it you run your analysis, store the results in the appropriate object like a model, data.frame or time.series then call xtable(x) and all the results come out in formatted latex.

With the latex, you then compile it with any of the compilers, e.g. a typical sequence that would automatically do all the table of contents, bibliography, index, figure and table references and list of steps would be:

  1. Sweave(“master.R”) — done in R
  2. pdflatex master — on the command line
  3. bibtex master — does the bibliography
  4. makeindex master — makes the index
  5. pdflatex master — put bibliography and index in
  6. pdflatex master — does the cross referencing

This should produce the finished pdf ready to print with the results of analysis, figures, and everything collated.

Hurst Coefficient Software

Long-range dependence is being identified many disciplines such as, networking, databases, economics, climate and biodiversity. LTP is competing with the sexy “long tail” for top spot as a theory of cultural consumption. Thus, the need for software offering complete long-range dependence analysis is crucial.

While there are some steps towards this direction, none are yet completely satisfactory. For one, the Hurst exponent cannot be calculated in a definitive way, it can only be estimated. Second, there are several different methods to estimate the Hurst exponent, but they often produce conflicting estimates, and it is not clear which of the estimators are most accurate.

A first step towards a systematic approach in estimating self-similarity and long-range dependence is the java tool called SELFIS, a java based tool that will automate the self-similarity analysis.

In R there is the fracdiff package on CRAN for fitting
frARIMA(p,d,q) models with fractional “d” and there’s a
one-to-one relationship between ‘d’ and the Hurst parameter for
these models: d = H – 1/2. These are better than the R/S method known to be far from optimal.

A useful summary of the issues and reference to other resources is
Estimating the Hurst Exponent

Demonstrating just how pervasive are these concepts in our daily life is the Physics of Fashion Fluctuations by
R. Donangeloa, A. Hansenb,c, K. Sneppenb and S. R. Souzaa.
Here a simple model for emergence of fashions — goods that become popular not due to any intrinsic value, but simply because “everybody wants it” — in markets where people trade goods shows spontaneous emergence of random products as money. The model supports collectively driven fluctuations characterized by a Hurst exponent of about 0.7.

Scale Invariance for Dummies is an investigation of scale invariance or long term persistence (LTP) in time series including tree-ring proxies – the recognition, quantification and implications for analysis – drawn largely from Koutsoyiannis.

Anybody know of any commercial packages dealing with the Hurst coefficient?

Rolin Jones Discovers Clark Glymour

A number of posts here and here have compared the “hockey stick” construction of past temperatures to the play by Rolin Jones to illustrate an area of science where dramatization and self-promotion have become confused with the search for scientific truth. The background of this story is fascinating.

In a story on the playwright at relates the story behind the play.

In doing research for the play, Jones was lucky to have as a resource the library at Yale University, where he was studying. He dusted off an eclectic tome called Android Epistemology, which had never been taken out by a student. “The pages were still stuck together,” he recalls. “I was amazed that somebody had written 250 pages on the theoretical epistemology of thinking, breathing robots.”

The book Thinking about Android Epistemology was edited by Kenneth M. Ford, Clark Glymour and Patrick J. Hayes, and Clark Glymour. Clark Glymour was one of the most influential writers for me on modeling. He has written extensively on causation, especially the conditions for inferring causation from observational data. He also has software called Tetrad implementing his ideas. While his work is little known, I think the issues he explores are central to scientific method.

He defines ‘knowability’ in terms of the ability of the ‘knower’, establishing a hierarchy of minds capable of proving hypotheses of increasing complexity similar to the pantheon of Greek Gods. For example, a simple existential hypothesis like “This is a ball” is provable by a mere mortal by producing evidence of the ball. A simple universal hypothesis like “All of these are balls” is disprovable by a mortal by showing an example that is not a ball. It is also provable in the case of a finite number of examples by enumerating each of the examples of balls. However, a mere mortal cannot enumerate a infinite number of examples of balls in finite time and so not prove the universal hypothesis. But a Greek God may have the power to do so by enumerating each ball in time 1/n, thus in the limit producing infinite balls in a finite time. Thus provability is a function of the power of the knower.

There is much, much more fascinating stuff in Glymour’s work. A list of his books are here. I hope the pages are not all stuck together as Rolin Jones found. Perhaps when robots start buying books “Thinking about Android Epistemology” will become a best seller.

R code in Econometrics

One of the best, and possibly the only, guide to advanced use of R is the manual “Econometrics in R” by Grant V. Farnsworth. Dated June 26, 2006 it was originally written as part of a teaching assistantship and personal reference. Some of the topics covered I have found nowhere else. The manual is particularly through in treatment of regression, i.e.:

3 Cross Sectional Regression
3.1 Ordinary Least Squares
3.2 Extracting Statistics from the Regression
3.3 Heteroskedasticity and Friends
3.3.1 Breusch-Pagan Test for Heteroskedasticity
3.3.2 Heteroskedasticity (Autocorrelation) Robust Covariance Matrix
3.4 Linear Hypothesis Testing (Wald and F)
3.5 Weighted and Generalized Least Squares
3.6 Models With Factors/Groups
4 Special Regressions
4.1 Fixed/Random Effects Models
4.1.1 Fixed Effects
4.1.2 Random Effects
4.2 Qualitative Response
4.2.1 Logit/Probit
4.2.2 Multinomial Logit
4.2.3 Ordered Logit/Probit
4.3 Tobit and Censored Regression
4.4 Quantile Regression
4.5 Robust Regression - M Estimators
4.6 Nonlinear Least Squares
4.7 Two Stage Least Squares on a Single Structural Equation
4.8 Systems of Equations
4.8.1 Seemingly Unrelated Regression
4.8.2 Two Stage Least Squares on a System

Is Temperature a Random Walk?

We use the data from CRU, and input it into R using the code in the post R Code to Read CRU Data. The initial approach to testing whether global temperatures from CRU is to run a Dickey-Fuller Test for Unit Root.
The augmented Dickey-Fuller test checks whether a series has a unit root. The default null hypothesis is that the series does have a unit root.

A unit root refers to the coefficeint b in the sequence being greater than one and leads to series with a tendency to wander infinitely far from the starting point, and display ‘trendiness’ or the appearence of trends with no specific driving variable.

Xt = bXt-1 + e

The adf.test() command needs the tseries library for this test.
R automates the process of finding, downloading and installing libraries. In Windows, you are presented with a list of packages after selecting a CRAN mirror site (if you are connected to the internet). Then, on selecting xseries all the required packages are loaded. In this case:

package 'scatterplot3d' successfully unpacked and MD5 sums checked
package 'mlbench' successfully unpacked and MD5 sums checked
package 'randomForest' successfully unpacked and MD5 sums checked
package 'SparseM' successfully unpacked and MD5 sums checked
package 'xtable' successfully unpacked and MD5 sums checked
package 'fBasics' successfully unpacked and MD5 sums checked
package 'sandwich' successfully unpacked and MD5 sums checked
package 'acepack' successfully unpacked and MD5 sums checked
package 'lmtest' successfully unpacked and MD5 sums checked
package 'car' successfully unpacked and MD5 sums checked
package 'dynlm' successfully unpacked and MD5 sums checked
package 'e1071' successfully unpacked and MD5 sums checked
package 'leaps' successfully unpacked and MD5 sums checked
package 'oz' successfully unpacked and MD5 sums checked
package 'Hmisc' successfully unpacked and MD5 sums checked
package 'chron' successfully unpacked and MD5 sums checked
package 'fCalendar' successfully unpacked and MD5 sums checked
package 'strucchange' successfully unpacked and MD5 sums checked
package 'DAAG' successfully unpacked and MD5 sums checked
package 'quadprog' successfully unpacked and MD5 sums checked
package 'zoo' successfully unpacked and MD5 sums checked
package 'its' successfully unpacked and MD5 sums checked
package 'tseries' successfully unpacked and MD5 sums checked

The downloaded packages are in
        C:\Documents and Settings\David Stockwell\Local Settings\Temp\Rtmp11853\downloaded_packages
updating HTML package descriptions

The next steps are to load the library, get the CRU data into a variable and run the Dickey-Fuller Test for Unit Root.

> library(tseries)
> d adf.test(d)

 Augmented Dickey-Fuller Test

data:  d
Dickey-Fuller = -6.3935, Lag order = 12, p-value = 0.01
alternative hypothesis: stationary

Warning message:
p-value smaller than printed p-value in: adf.test(d)

We can try this again with annual values.

d adf.test(d)

        Augmented Dickey-Fuller Test

data:  d
Dickey-Fuller = -2.0938, Lag order = 5, p-value = 0.5372
alternative hypothesis: stationary

For comparison we can apply the test to random numbers which don’t have a unit root, and the cumulative sum of random numbers that do. i.e.


The results are similar to the results for monthly and annual temperature. Monthly temperatures don’t have a unit root and annual temperatures do have a unit root. This preliminary testing suggests annual temperatures are indistinguishable from a random walk. This doesn’t mean that they are, as more rigorous testing may succeed in rejecting the null hypothesis. The post Scale Invariance for Dummies suggests that temperatures are actually very close, but slightly less than the unit root.

R Code for Brownian Motion

According to Wikipedia the mathematical model for Brownian motion (also known as random walks) can also be used to describe many phenomena as well as the random movements of minute particles, such as stock market fluctuations and the evolution of physical characteristics in the fossil record. The simple form of the mathematical model for Brownian motion has the form:

St = eSt-1

where e is drawn from a probability distribution. My initial implementation of Brownian motion in R and 2 dimensions is this:


brownian< -function(n=1000,plot=T,fun=rnorm) {

Continue reading R Code for Brownian Motion

R Code to Read CRU Data

Reading CRU data is an opportunity to demonstrate some of the features available for programming in R. The Climate Research Unit (CRU) data is a record of the global, northern and southern hemisphere temperatures compiled from temperature sources around the globe for the last 150 years. The files are located at and look like this, with alternating lines of numbers and values for each month, and annual averages at the end:

 1856 -0.405 -0.486 -0.985 -0.277 -0.140  0.313 -0.009
-0.220 -0.391 -0.540 -0.985 -0.357 -0.374
 1856     13     18     16     16     15     16     14     14
17     14     16     16
 1857 -0.520 -0.035 -0.536 -0.848 -0.582 -0.186 -0.044
0.032 -0.402 -0.619 -0.791  0.341 -0.349

The obvious brain-dead approach is given on the web site. Download the file such as crutem3gl.txt and write something like the FORTRAN pseudo-code with sequential reads of each line in a loop to the end of the file. Actual code in C would entail two more loops for reading the values to the ends of the lines, and code for handling the incomplete line for the current year.

 for year = 1850 to endyear
  format(i5,13f7.3) year, 12 * monthly values, annual value
  format(i5,12i7) year, 12 * percentage coverage of hemisphere or globe

Here is my example of the same function for extracting and plotting the global temperature in R using R features. Some R sugar is: the data is read from a URL directly rather than a downloaded file. Secondly the function readCRU is defined with default values that can be overridden if required.

readCRU< -function(source="",temps=2:13,plot=T) {
if (plot) plot(d,type="l")


Most importantly, using the matrix and vector capacity of R, the entire dataset is read into a data.frame with one command using the read.table command. In the next line R’s powerful indexing selects the desired columns (i.e. f[seq(1,length(f[,1]),by=2),temps]). The data.frame is then converted to a transposed matrix and unrolled into a vector for plotting with a single plot command.

R allows all loops to be eliminated. The power of features including reading from URLs, default parameters and the plot commands make data analysis quick. Anyone what to post a shorter solution?

A Simply Told Ptolemaic

The Washington Post has finally commented on the Wegman Report, and Whitfield hearings I and II on the so-called “hockey stick” graph — a trend line that purports to show little temperature variation throughout the Medieval Warm Period and a sudden and dramatic increase in global temperatures in the 1990s and therefore looks like a hockey stick. Their position:

The graph is hardly central to the modern debate over climate change. Yet the subcommittee has investigated the scientists who dared produce it and hounded them for information.

This despite the graph being in the summary for policy makers in the IPCC 2001, and used by dozens of major government agencies throughout the world to motivate global warming programs. And their spin on scientists being asked to justify their results — anyone would think we were back in the Medieval Warm Period and the hockey stick was the equivalent of the Ptolemaic system with the Earth the center of the universe.

So what is it all about? A good starting point is the statement here: Some Thoughts on Disclosure and Due Diligence in Climate Science. Subsequent to these efforts, the Wegman report uncovered a number of fictions in an area of climate science, and offered a number of constructive solutions to the pervasive problems they discovered.

So you can decide if this is important or not, below are compiled some historic statements by climate scientist Michael Mann and others regarding their science, together with relevant comments on the Mann study from the Wegman Report, and others.

On the Medieval Warm Period

While warmth early in the millennium approaches mean 20th century levels, the late 20th century still appears anomalous:
the 1990s are likely the warmest decade, and 1998 the
warmest year, in at least a millennium.

Northern Hemisphere Temperatures During the Past
Millennium: Inferences, Uncertainties, and Limitations

Michael E. Mann and Raymond S. Bradley
Malcolm K. Hughes AGU GRL galley style, v3.1, 14 Feb 94

Very little confidence can be placed in statements about average global surface temperatures prior to A.D. 900 because the proxy data for that time frame are sparse

Continue reading A Simply Told Ptolemaic

Simple Linear Regression Models

A “simple” regression model is simple because it has a single independent variable instead of multiple independent variables. Because simple is in the name, many people make the mistake of thinking they are simple to use. One mistake is to first apply them to their data, without checking to see if the assumptions are met. Here is a useful web page from Duke University called Not-so-simple Regression Models describing a general approach to developing simple linear regression models.

Continue reading Simple Linear Regression Models

Intelligent Emotions

Emotional Intelligence or EI is a concept popularized by Daniel Goleman
as a complement to competence measures like IQ in the emotional
sphere. But EI has the problem that it is not quantitatively defined
with a number and standards like IQ. So it has been criticized by
people like Eysenck:

“exemplifies more clearly than most the fundamental absurdity of the tendency to class almost any type of behavior as an ‘intelligence’.”

Statistics are the quintessential antiemotional tollgate.
The “Little Handbook of Statistical Practice” is one of the deepest and
best guides to statistics I have seen. Here Gerard Dallal asks
Is Statistics Hard?“.
Statistics is not so much hard as counterintuitive: backwards, convoluted and
shades of grey.

Continue reading Intelligent Emotions

How to Predict Random Numbers

The previous post “Random Numbers Predict Future Temperatures” used random numbers for prediction of climate. Random numbers may also be predicted. This is a major difference between models and natural phenomena. Random numbers generated by computer can always be predicted exactly given knowledge of the code, and so have a deterministic generating mechanism, or model.

The above series of numbers appears to be a temperature trend such as 20th century warming, composed of random fluctuations with a slight upward trend. One might find on fitting a linear regression the coefficients are significant, and then use the model to predict the trend in upward temperatures to continue.

It is actually possible to predict all subsequent numbers in the series exactly with the following deterministic code for generating pseudo-random numbers in R.

dkrand< -function(n) {
q[1] <-1426594706

for (i in x) {
q[i+1]<- (k * q[i]) %% m
x [i]<- q[i+1]/ m

This examples shows you should be very concerned if you find a paper in which the authors use a model building technique such as regression and then use it for prediction, treating the resulting models and coefficients as though the model were true.

The confusion is very well described in the essay by Gerard Dallal “The Most Important Lesson You’ll Ever Learn About Multiple Linear Regression Analysis“.
It is well-stated by Chris Chatfield 1995.

It is “well known” to be “logically unsound and practically misleading” to make inference as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.


Chris Chatfield in “Model Uncertainty, Data Mining and Statistical Inference”, Journal of the Royal Statistical Society, Series A, 158 (1995), 419-486 (p 421).

Niche Media

If a picture is worth a thousand words, a video is worth more. The use of compelling media has been undergoing something of a revolution recently, driven by new social sites like YouTube. I like Salsa music, and found this clip of the Colombian band Guayacan (I think I saw these guys in Mexico City in 1997 – amazing act). If that’s a little too strong for your taste, here is another Colombian Salsa band
Grupo Niche. The YouTube science category has some cool robot clips.


One of the easiest ways to animate your results is images including GIF and PNG displayed sequentially, and if required in a continuous loop. Tools like ImageMagick generate these from multiple still images. Here is one of my early efforts using predictions of distribution of Swainson’s Flycatcher at each month of the year to display dynamic models of bird migration.


Animation of migration modeling of the Swainson’s Flycatcher in South America. It shows ‘niche following’ behavior, presence in a constant temperature zone in South America in relation to the sightings of the species.

In a more recent use of the same technology shows the potential course of invasion of the species Zebra Mussel (Dreissena) animated using the niche model in a process described in my upcoming book ‘Niche Modeling’.


This is a trend that could be used a lot more to get your message across. The latest buzzword is ‘viral media’ — a few minute elevator style presentation that grabs millions of eyeballs across the globe. For example, promoting books such as Chris Anderson’s in the Day of the Long Tail and Flash is used for Mike Levin’s breakthrough technology for bloggers HitTail.

Here is my attempt seven years ago to explain the importance of Museum collection to biodiversity modeling in The Living Collections. At that time a film production studio was needed to produce such videos.

Now, with technologies like Flash and iFilm the bar has dropped considerably. For an example of how much can be done to make your presentation dynamic with even simple tools like Powerpoint or Keynote on the Mac check out Skip Hardt’s SXIP Identity 2.0 demo.

Examples of ways niche media could be used to effect:

  • In presentations, show the effect of a change in a parameter on lines on a graph, instead of two graphs, fade out the first line and fade in the second line, then repeat.
  • Create 5 minute promotions;
  • Organize and display results according to ‘work flow’.