2

The Kullback-Leibler divergence (or relative entropy) is a measure of how a probability distribution differs from another reference probability distribution. I want to know what connection it has to the maximum entropy principle, which says that the uniform ($1/N$) distribution has max entropy.

If the reference distribution is the uniform distribution, and I minimize the KL-divergence of some empirical data's probabilities to that reference, can this be viewed somehow as an attempt to attain maximum entropy in that a KL-divergence of 0 means the empirical is identical (does not diverge from) the target uniform distribution?

develarist
  • 3,009
  • 8
  • 31
  • This is the maximum-relative-entropy principle. See for example https://doi.org/10.1007/BF01014906 or Csiszár: *An extended maximum entropy principle and a Bayesian justification*, or https://arxiv.org/abs/1706.02561 – pglpm Aug 05 '20 at 08:34
  • "the Kullback measure is a generalization of the Shannon measure, and that the Kullback measure has more reasonable additivity properties than does the Shannon measure. The results lend support to Jaynes's entropy maximization procedure." but this doesn't answer the question about minimizing KL when the reference distribution is uniform – develarist Aug 05 '20 at 08:52
  • 1
    "Thus $U_k$(ullback) reduces to $U_s$(hannon) in the special case of a constant prior distribution and a "zero or one" distribution of maximum information" (section 3) – pglpm Aug 05 '20 at 08:59
  • where is that from. i have no access to those articles – develarist Aug 05 '20 at 09:27
  • From the first reference in my first comment. Unfortunately I can't find a free version... – pglpm Aug 05 '20 at 09:39
  • @pglpm I couldn't find Csiszár: An extended maximum entropy principle and a Bayesian justification. Is this the original derivation of this maximum-relative-entropy principle? – Fred Guth May 13 '21 at 14:35

1 Answers1

1

Yes. The model with maximum entropy has the minimum KL divergence to the uniform distribution. \begin{align*} \max H(q) &\equiv \min {\textit{KL}}(q \Vert u)\\ u(x,y) &= \text{constant} \end{align*}

Proof: \begin{align*} KL(q \Vert u) &= \sum q \log \frac{q}{u}\\ &={\textstyle \sum q \log q} - \textstyle \sum q { \log u}\\ &=- H(q) - \text{constant}\\ \min {\textit{KL}}(q \Vert u) &= \min~(- H(q) - \text{constant}) \\ \min {\textit{KL}}(q \Vert u) &= \max H(q) \end{align*}

Fun fact: I derived this for my master's dissertation. I didn't know if it was important and didn't know if it was used/derived before. I found this question by searching for a reference for this derivation. I found Giffin and Caticha 2007 (Updating Probabilities with Data and Moments), which goes much further than this. Probably there is another reference. If you find, please let me know. :-)

Fred Guth
  • 153
  • 6
  • [Here is Csiszár's paper](https://www.uv.es/bernardo/BayesStatist2.pdf). But the "maximum-entropy-like" use goes farther back: [Jaynes 1963 §4.b](https://bayes.wustl.edu/etj/articles/brandeis.pdf), maybe even Renýi 1961: [*On measures of entropy and information*](https://projecteuclid.org/proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/Proceedings-of-the-Fourth-Berkeley-Symposium-on-Mathematical-Statistics-and/Chapter/On-Measures-of-Entropy-and-Information/bsmsp/1200512181) and possibly earlier? – pglpm May 13 '21 at 18:18
  • 1
    Thanks a lot @pglpm – Fred Guth May 13 '21 at 18:55
  • I think this argument breaks down unless you ensure $q$ and $u$ have the same support, otherwise you violate the absolute continuity of KL divergence, i.e. if $u(x)=0$ then it must be that $q(x)=0$. – adamconkey Feb 09 '22 at 04:03