Justification for invoking Maximum Entropy

Question

I find ME interesting, but I find it puzzling as to when, in the real world, it should be invoked. My concern is that the utility of ME is exaggerated, though I would be extremely happy to have this concern allayed. Let me elaborate on my thinking: Suppose I have a set of data. From these I might measure their variance, and with that number in hand, I could maximize entropy subject to the constraint that the result match the variance I have measured. I would then obtain the normal distribution. But, on this basis, am I really supposed to believe that the normal distribution is an appropriate description of my data? I think not. What if I chose to measure some other quality of the data, I don't know, maybe kurtosis or mean absolute deviation, or something, then invoking ME would yield a different distribution. Same data, same process generating the data, but now, all of a sudden, based on what I have chosen to measure, a different distribution seems to be implied. To me, this is not how analysis usually works. If I have data, I can measure the variance, sure, but I would probably want to invoke the central limit theorem to justify use of the normal, not ME. So I'm left wondering, in a practical world, with data already in hand, under what circumstances would one actually invoke the very attractive mathematics of Maximum entropy probability distribution?

if you use the data to construct your prior it is no longer a prior... — Xi'an, Jul 18 '15 at 17:30
Presumably the mean and variance would be all you know. Of course if you have more information you'd no longer treat things as though you only knew a mean and a variance. This is somewhat like saying before I flip a coin I would treat it as a Bernoulli random variable, but after I see that it's heads I wouldn't. So why would treat it as Bernoulli distributed at all? — dsaxton, Jul 18 '15 at 17:40
Agreed. Perhaps I'm missing the point, then (I'm learning, here). From my perspective, the CLT can motivate a model distribution, the normal. And, this is not a "prior distribution", but, rather a "conditional distribution" N(x|mu,sigma), conditional on mu and sigma being parameters. I could hypothesize that the process giving rise to my data are like the process for which the CLT applies. I can't, however, imagine the circumstance under which I could hypothesize that the process giving rise to my data are like the process for which ME applies. An example would certainly be helpful! — Isambard Kingdom, Jul 18 '15 at 18:05
I think it's partly a question of philosophy. When you have data arising from an unknown source and want to postulate a parametric model, there is no rigorous way of going about it. So a "solution" is to use some heuristic that seems reasonable, like assuming a model which reflects maximum ignorance about the underlying data generating process. This is basically the concept behind maximum entropy. I can't really think of a practical example that would show why this makes sense, I guess it's a matter of intuition. — dsaxton, Jul 18 '15 at 20:03
dsaxton, yes, and it seems to me that the problem with justifying use ME is even deeper. One needs to assume a heuristic and, also, assume that some other quantity, like mean or variance, is actually given. Not just calculable from the data, but, rather, given or known beforehand. To me that is very unrealistic. When would one have any inkling of such a quantity before one even collects any data? So, again, I'm comfortable in assuming a *process* that gives rise to a distribution, but I'm not comfortable in invoking ME + constraint to suppose a distribution. — Isambard Kingdom, Jul 18 '15 at 20:17
I had the same thought, and I completely agree. It is an interesting property of the normal distribution that it is the ME distribution for given mean and variance, but there are not so many situations in which mean and variance actually constitute the given information. The only situation I can think of is that you read an old paper which reported this information based on some data, but the data are no longer accessible. Hardly a common paradigm. — A. Donda, Jul 18 '15 at 23:22
And if you have the data, there is no reason to throw away information and reduce the data to these two parameters. And as you write, you could just as easily and arbitrarily choose some other set of parameters that give an ME distribution. — A. Donda, Jul 18 '15 at 23:23
@Xi'an, I don't understand the comment, since the question does not refer to prior distributions at all. — A. Donda, Jul 18 '15 at 23:23
My answer here: http://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 might be of interest. — kjetil b halvorsen, Jan 02 '17 at 18:28

score 2 · Answer 1 · edited Jun 11 '20 at 14:32

First of all, I tried to comment on the question but I couldn't because I didn't have (still don't) 50 reputation, so I'm posting my opinion as an answer despite knowing it is not a complete answer to what's asked..

On a Bayesian probabilistic framework, probabilities are considered as "degrees of belief", that is, the "most rational" measure of plausabilities of assertions on situations of incomplete information (1).

On that context, one can show (2) that the Principle of Maximum Entropy is the reasonable choice to update a "state of knowledge" of a rational agent to another state when stumbled with new information that constrains its' knowledge (for example, more data). So that's Maximum Entropy as I see it.

Your question, however, concerns of priors, which are a much more delicate matter. That's more or less terrain of philosophical, epistemological questions: some will argue that priors are a non-sense, because science is remarkable certain, and can't allow space to subjectivities; on the other side, some other people will argue that it's truly subjective, and there is no complaint about it, because we will never have access the "true nature" of our object of study.

I prefer to justify bayesian subjectivity and priors by saying that we normally won't have access to the objective, mechanistical description of our event - or maybe we have, but we don't want to! (because of the many degrees of freedom, low computational power et cetera). On these scenarios, we take advantage on every piece of key information that we have on the problem (not the data!): be it symmetries, moments, or whatsoever, so that we can codify our knowledge as probabilities and update it on the basis of more data.

On the behalf of Bayesian and Entropic Inference, I shall try to explain the reasoning behind these wrongly understood topics by quoting quote Bertrand Russell:

I wish to propose for the reader’s favourable consideration a doctrine which may, I fear, appear wildly paradoxical and subversive. The doctrine in question is this: that it is undesirable to believe in a proposition when there is no ground whatever for supposing it true.

Bertrand Russell, in Sceptical Essays

The main point is that we can't use information that we don't have or ignore information that we do have when constructing our priors. All knowledge must be considered.

I must apologize for my bad english, I'm not a native speaker and sometimes (many times) I write some things thinking they're right but they are not. I'd really appreciate if someone pointed my mistakes ^^

Justification for invoking Maximum Entropy

1 Answers1