This is an situation that causes researchers to react in different ways. Some grapple with it, others exploit it, many avoid it. If you have experience with controversial research, your comments will be summarized and posted so others can benefit.
Niche products, niche businesses, niche markets — small businesses are often qualified as surviving in niches, finding their niche, defining or redefining their niche. Most people think entrepreneurs start their own business, but most buy running ventures, like franchises, already in surviving niches. Steps to understanding and predicting quantitative business niches are in early stages.
I agree that a bimodal distribution is seldom seen. Well, my experience is not from ecological but mainly from hydrological processes but I suspect that the behaviours would be similar.
I have seen claims of bimodality several times but I was never convinced about them as I did not read any argument supporting it except empirical histograms. However, we must be aware that the uncertainty of the histogram peaks is large. A simple Monte Carlo experiment with say a normal distribution suffices to demonstrate that (unless the number of generated values is very high) it is very common to have a histogram with two, three or more peaks. This however is totally a random effect; obviously the normal density is unimodal.
So, I think that one must have theoretical reasons to accept a bimodality hypothesis. As a simple illustration, consider a system described by a random variable X, which switches between two well defined states, 1 and 2 with probabilities p and 1-p. Assume that the conditional density of X given the state is normal in each of states 1 and 2 and denote it f1(x) and f2(x), respectively. Then the unconditional density will be p f1(x) + (1-p) f2(x). It can be easily observed that if the means of the two densities are different, then certain combinations of the standard deviations and the probability p result in a bimodal unconditional density.
Before I attempt to describe my answer, I would like to do some clarifications on the nature of a statistical prediction and mention some points than need caution.
1. A statistical prediction should be distinguished from a deterministic prediction. In a deterministic prediction some deterministic dynamics of the form y = f(x1, …, xk) are assumed, where y is the predicted value, the output of the deterministic model f( ), and x1, …, xk are inputs, i.e. explanatory variables. The model f( ) could be either a physically based one or a black box, data driven one. The latter case is very frequent, e.g. in local linear (chaotic) models and in connectionist (artificial neural network) models.
Now in a statistical prediction we assume some stochastic dynamics of the form Y = f(X1, …, Xk, V). There are two fundamental differences from the deterministic case. The first, apparent in the notation (the upper-case convention), is that the variables are no more algebraic variables but random variables. Random variables are not numbers, as are algebraic variables, but functions of the sample space. This is very important. The second difference is that an additional random variable V has been inserted in the dynamics. This sometimes is regarded as a prediction error that could be additive to a deterministic part, i.e. f(X1, …, Xk, V) = fd(X1, …, Xk) + V. However, I prefer to think of it as a random variable manifesting the intrinsic randomness in nature.
Let me offer a little of the terminology of forecasting, which, I hope, will make the question clearer. When you are forecasting from some kind of structural model, say Y = f(X1, …, Xk), there is a difference in whether you have to forecast the Xs as well as the Y. If you don’t, it is an unconditional forecast; if you do, it is conditional. For an unconditional forecast, the inference is a pretty straightforward exercise of the classical linear model, at least if you structure relationship is so estimated and a nonlinear version for nonlinear estimations. For a conditional forecast, life can be messy since you have to take into account the distribution of the exogenous, the Xs, as well as the error term.
I recently reviewed a forecast where the analyst treated a conditional forecast like an unconditional one, and consequently underestimated the forecast error by over a factor of two. Read the rest of this entry…
I think that such questions should not be treated in an algorithmic manner and that it is important to formulate them in the clearest and most consistent manner.
So, let us assume that we have a nonstationary stochastic process X(t) and a stationary process Y(t); I have interpreted here “variable” as process because the notion of stationarity/nonstationarity is related to a (stochastic) process, not a variable. Is the question, how to establish a regression relationship between Y(t) and X(t)? For instance a relationship of the form Y(t) = a(t) X(t) + V(t), where a(t) is a deterministic function of time and V(t) a process independent or uncorrelated to X(t)? Without going into detailed analysis it seems to me that in such a relationship it is difficult to have a constant a(t) = a, i.e. independent of time. Also, V(t) should be nonstationary too. So, while we can consider a time series (observations) of the stationary Y(t) as a statistical sample, we cannot do the same for X(t) or V(t). So, I doubt if there is a statistical procedure to infer a(t) and the statistical properties of V(t) (mean, variance, etc.) which are functions of time too. In addition, I do not find such a relationship useful at all.
This is the wrong question. The analyst shouldn’t be worried about whether the dependent or independent is stationary or non-stationary. The issue is the error term.
In the Box-Jenkins procedure(s) — or maybe I should call it paradigm — the non-stationary stuff is removed. To me that removal is what is interesting, and all the stuff that Messrs. Box and Jenkins do is the treatment of serial correlation. But be you structural econometrician or time-series statistician, you can merrily regression a stationary variable on a non-stationary variable. You merely have to recognize that there is no impunity in regression. So you still have to check the residuals to see if they behave in a roughly white noise manner.
I think that statistical predictions tend always to the mean as time increases. If we use the maximum entropy principle to obtain these predictions, the result depends on the time scale of entropy maximization. For instance, if the entropy maximization is done on the observation time scale, then the prediction may be equivalent to a prediction obtained by a Markov model. However, other settings of entropy maximization (on several time scales) result in long range dependence (as I have demonstrated in my 2005 paper “Uncertainty, entropy, scaling and hydrological stochastics, 2, Time dependence …” in Hydrological Sciences Journal). In the latter case statistical predictions may tend to the mean much slower than in the Markovian case and their confidence intervals would be much wider. Also, in Monte Carlo realizations, the excursions from the mean will be longer and wider.
One of the main inputs into a niche model is the environmental variables. Optimizing the choice of variables is important for many reasons, primarily interpretation and subsequent accuracy on independent test data.
In almost all cases to date, annual climate averages have been used in modeling species distributions. Where models have been developed and annual averages of climate compared with monthly variables and others such as vegetation, improvements in the accuracy were attributed to the monthly climate data sets (i.e. greater temporal resolution).
The post Bayesian Networks introduced this useful and flexible form of modeling. Here is an example of a Bayesian Belief Net or BBN model of a simple three variable species prediction system.
The previous post “Writing a Book Using R” described using latex for writing a book, saving time with one master bibliography and other organizational devices. Sweave allows R code to be included in a latex file. This is a good marriage; while latex provides typeset text; R is statistically and graphic oriented.
Long-range dependence is being identified many disciplines such as, networking, databases, economics, climate and biodiversity. LTP is competing with the sexy “long tail” for top spot as a theory of cultural consumption. Thus, the need for software offering complete long-range dependence analysis is crucial.
A number of posts here and here have compared the “hockey stick” construction of past temperatures to the play by Rolin Jones to illustrate an area of science where dramatization and self-promotion have become confused with the search for scientific truth. The background of this story is fascinating.
One of the best, and possibly the only, guide to advanced use of R is the manual “Econometrics in R” by Grant V. Farnsworth. Dated June 26, 2006 it was originally written as part of a teaching assistantship and personal reference. Some of the topics covered I have found nowhere else. The manual is particularly through in treatment of regression, i.e.:
We use the data from CRU, and input it into R using the code in the post R Code to Read CRU Data. The initial approach to testing whether global temperatures from CRU is to run a Dickey-Fuller Test for Unit Root.
The augmented Dickey-Fuller test checks whether a series has a unit root. The default null hypothesis is that the series does have a unit root.
According to Wikipedia the mathematical model for Brownian motion (also known as random walks) can also be used to describe many phenomena as well as the random movements of minute particles, such as stock market fluctuations and the evolution of physical characteristics in the fossil record. The simple form of the mathematical model for Brownian motion has the form:
St = eSt-1
where e is drawn from a probability distribution. My initial implementation of Brownian motion in R and 2 dimensions is this:
Reading CRU data is an opportunity to demonstrate some of the features available for programming in R. The Climate Research Unit (CRU) data is a record of the global, northern and southern hemisphere temperatures compiled from temperature sources around the globe for the last 150 years. The files are located at http://www.cru.uea.ac.uk:80/cru/data/temperature/ and look like this, with alternating lines of numbers and values for each month, and annual averages at the end:
The Washington Post has finally commented on the Wegman Report, and Whitfield hearings I and II on the so-called “hockey stick” graph — a trend line that purports to show little temperature variation throughout the Medieval Warm Period and a sudden and dramatic increase in global temperatures in the 1990s and therefore looks like a hockey stick. Their position:
The graph is hardly central to the modern debate over climate change. Yet the subcommittee has investigated the scientists who dared produce it and hounded them for information.
This despite the graph being in the summary for policy makers in the IPCC 2001, and used by dozens of major government agencies throughout the world to motivate global warming programs. And their spin on scientists being asked to justify their results — anyone would think we were back in the Medieval Warm Period and the hockey stick was the equivalent of the Ptolemaic system with the Earth the center of the universe.
So what is it all about? A good starting point is the statement here: Some Thoughts on Disclosure and Due Diligence in Climate Science. Subsequent to these efforts, the Wegman report uncovered a number of fictions in an area of climate science, and offered a number of constructive solutions to the pervasive problems they discovered.
So you can decide if this is important or not, below are compiled some historic statements by climate scientist Michael Mann and others regarding their science, together with relevant comments on the Mann et.al study from the Wegman Report, and others.
On the Medieval Warm Period
While warmth early in the millennium approaches mean 20th century levels, the late 20th century still appears anomalous:
the 1990s are likely the warmest decade, and 1998 the
warmest year, in at least a millennium.
Very little confidence can be placed in statements about average global surface temperatures prior to A.D. 900 because the proxy data for that time frame are sparse
A “simple” regression model is simple because it has a single independent variable instead of multiple independent variables. Because simple is in the name, many people make the mistake of thinking they are simple to use. One mistake is to first apply them to their data, without checking to see if the assumptions are met. Here is a useful web page from Duke University called Not-so-simple Regression Models describing a general approach to developing simple linear regression models.
Emotional Intelligence or EI is a concept popularized by Daniel Goleman
as a complement to competence measures like IQ in the emotional
sphere. But EI has the problem that it is not quantitatively defined
with a number and standards like IQ. So it has been criticized by
people like Eysenck:
“exemplifies more clearly than most the fundamental absurdity of the tendency to class almost any type of behavior as an ‘intelligence’.”
Statistics are the quintessential antiemotional tollgate.
The “Little Handbook of Statistical Practice” is one of the deepest and
best guides to statistics I have seen. Here Gerard Dallal asks
“Is Statistics Hard?“.
Statistics is not so much hard as counterintuitive: backwards, convoluted and
shades of grey.
The previous post “Random Numbers Predict Future Temperatures” used random numbers for prediction of climate. Random numbers may also be predicted. This is a major difference between models and natural phenomena. Random numbers generated by computer can always be predicted exactly given knowledge of the code, and so have a deterministic generating mechanism, or model.
If a picture is worth a thousand words, a video is worth more. The use of compelling media has been undergoing something of a revolution recently, driven by new social sites like YouTube. I like Salsa music, and found this clip of the Colombian band Guayacan (I think I saw these guys in Mexico City in 1997 – amazing act). If that’s a little too strong for your taste, here is another Colombian Salsa band Grupo Niche. The YouTube science category has some cool robot clips.