Understanding parameter learning use of Principle of maximum entropy in Bayesian networks

Question

I was reading Bayesian network on wiki:

https://en.wikipedia.org/wiki/Bayesian_network

And It stated for parameter learning use of "Principle_of_maximum_entropy".

And according to Principle_of_maximum_entropy on wiki:

https://en.wikipedia.org/wiki/Bayesian_network

"By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess. Thus the maximum entropy distribution is the only reasonable distribution"

It seems counter intuitive, that we want to maximize entropy, which is "most uninformative distribution". Can somebody help me, what I am missing here?

See also: https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 — kjetil b halvorsen, Aug 13 '19 at 08:47

passerby51 · Answer 1 · 2016-12-14T16:53:05.237

You would like to find the most uninformative distribution possible that matches your data (or some aspects of your data). This way, you are making the least assumptions about the nature of the random process generating the data.

Consider that the most informative one is probably going to overfit the data.

Here is an example: Suppose I have iid data $(X_1,X_2,\dots,X_n) = (0,1,0,1,1,2,1,0,0)$ from a distribution on a 3-letter alphabet $\{0,1,2\}$. Let $p_i$ be the probability of observing $i=0,1,2$ from this distribution. Suppose that I want to pick a distribution (equivalently vector $(p_0,p_1,p_2)$) whose mean matches the observed sample mean $\bar X = \frac1n \sum_{i=1}^n X_i$.

The maximum entropy principle solves the following problem $$ \max_{p} -\sum_{i=0}^n p_i \log p_i, \; \quad \text{subject to}\quad 0 p_0 + 1p_1 + 2p_2 = \bar x $$ and that $p$ is a probability vector: $\sum_i p_i = 1$ and $p_i \ge 0$ for $i=0,1,2$. Entropy is a concave function and maximizing it over a convex region (usually) leads to to some $p$ from the interior of the region. On the other hand minimizing the entropy, a concave function, gives you one of the extreme points, i.e., one of $(1-\bar x,\bar x,0)$ or $(1-\frac12 \bar x,0,\frac12\bar x)$ which are less random distributions, putting all the mass on two of the letters out of three. (I will leave the details to you to figure out. You can eliminate $p_0$ and try to plot the feasible region in $p_1$-$p_2$ plane.) Which of these two extreme points is the solution (seems to be) determined by which of $\bar x/2$ or $\bar x$ is closer to $1/2$. For example if $\bar x = 0.6$, then $\bar x$ is closer to $1/2$ than $\bar x/2$, hence the solution will be $(0.4,0.6,0)$, i.e., you will put all the mass on $\{0,1\}$ even if you have observed some $2$s among your sample.

Without any data constraint, i.e., removing $0p_0 + 1p_1 + 2p_2 = \bar x$ maximizing the entropy gives you the uniform distribution $(1/3,1/3,1/3)$ while minimizing it gives you one of the deterministic distributions $(1,0,0)$, $(0,1,0)$ or $(0,0,1)$. A uniform prior before you see any data is better than assuming a priori with certainty that one of the letters is always coming up.

Well, the last part helped somewhat. So I talk about a situation. There is a binary target variable(1/0) with event rate 1%. So I assume when we say maximize entropy, we are basically, trying to find a point where it is extreme biased towards either of class, thereby helping us reaching population of interest? — muni, Dec 15 '16 at 07:58
Example is a dataset with a binary target(where rate is 0.01) and some input variables. Now bayesian network will try to optimize the maximum entropy for the target variable, by choosing the parameter, which maximizes this rate where 0.01 is deterministic distribution? — muni, Dec 16 '16 at 08:23

Understanding parameter learning use of Principle of maximum entropy in Bayesian networks

1 Answers1