5

Some statistical tools offer methods of automatic identification of the distribution of data, such as the one shown in this post.

Generally, is this approach reliable? If not, what's a better alternative to the problem of working with data from unknown distributions?

ivanmp
  • 223
  • 2
  • 7
  • I'm fairly new to statistics (been studying for a couple of months), so I don't have a good notion of how things are -- or should be -- done in the real world. – ivanmp Aug 18 '14 at 17:42

2 Answers2

6

The problem with this approach in practice is that most data sets are small enough that many distributions will adequately fit the data. If you arbitrarily pick a distribution that happens to fit the data and then proceed to do calculations or a simulation under this assumption, you can be badly mislead.

This problem occurs frequently in discrete event simulation modeling. One practical approach is to run the simulation model using a variety of distributions to see whether the results are sensitive to the distributional assumptions.

If you're doing statistical analysis, then nonparametric statistics can often be used to analyze your data without making distributional assumptions.

Brian Borchers
  • 5,015
  • 1
  • 18
  • 27
6

Identifying the distribution of data is essentially impossible.

The class of distribution functions is very large; it must be at least as large as the cardinality of $\mathbb{R}$ (e.g. consider only the unit step functions, corresponding to a constant value at some $x$ - there are as many of those as the cardinality of $\mathbb{R}$, so it must be at least that large).

Further, any cdf has an infinity of "near neighbors" that at a given sample size are hard to tell from any given distribution. (e.g. if we consider the KS-statistic to tell them apart, there's an infinite number of distributions close enough to the true distribution that a test at some sample size won't be able to tell the difference).

So the idea that we can say this must be the distribution, on the basis of a sample is a hopeless task.

If we restrict ourselves to some small list of candidates, then at some large sample size we might hope to rule out almost all the list (which sounds useful) ... but then we may actually end up ruling out the entire list (and indeed as sample sizes become large, this becomes essentially a certainty, because the chances our list includes the actual distribution of the data will be essentially zero).

[Further, tools which have a very large list of distribution families to choose from often "overfit", which can be counterproductive. One hopes that they might eventually catch up with some of the ideas which help avoid this problem, but even then, in general domain knowledge is going to be a better tool for good model choice than some arbitrary list of distributions.]

Indeed the entire approach seems pointless, because not only are real distributions typically going to be more complex than we can ever hope to identify (e.g. we might conceive of them consisting of arbitrary mixtures), knowing the true distribution class would be effectively useless as a model (with more parameters than observations, for example).

Probability models are in general little more than (hopefully) useful approximations. We should treat them as such.

So our interest should not be on identifying the distribution, but a distribution - one which adequately describes the situation well enough for our purposes, but which is simple enough for us to do something with at the sample sizes we can get.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Nice answer! I'm still a little confused, though. In the end you say that we should only be looking for an approximation of the real distribution. So it seems a bit contradictory for me. The way I interpreted the answer is that tools such as the one I mention **could** be of help, if after all you're just looking for _a distribution_, or an approximation of the real distribution. – ivanmp Aug 19 '14 at 12:06
  • 2
    I was suggesting that "automatic identification" of a specific distributional form in exactly the sense of your link is usually a bad idea. I even explain why in my answer. Explained another way, it's a bad idea for pretty much the same reason that in regression throwing a hundred sets of three predictors at a response and choosing the most significant set of three is usually a bad idea. – Glen_b Aug 19 '14 at 12:12
  • Oh, I see. I just wanted to confirm my interpretation of your answer. – ivanmp Aug 19 '14 at 12:15
  • To expand on that -- in the regression analogy I mentioned -- there such an approach *could* be of help, too. But on average, it will lead to several problems - p-values that are "too low" (thing that seem more significant than they are), standard errors that are too small, and poor out-of-sample performance. – Glen_b Aug 19 '14 at 12:26
  • 1
    Cross validation might be one approach to dealing with these kinds of issues. – Glen_b Aug 19 '14 at 12:32