Relationship between entropy and information gain

Question

Based on papers :1. Deniz Erdogmus, Member, IEEE, and Jose C. Principe, An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems

J. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” in Unsupervised Adaptive Filtering, S. Haykin, Ed. New York: Wiley, 2000, vol. I, pp. 265–319.

Entropy (Shannon and Renyis) has been used in learning by minimizing the entropy of the error as an objective function instead of the Mean Square error. Rationale is that minimizing entropy = maximizing mutual information.

Now, entropy = disorder = uncertainty. Higher the uncertainty, more is the entropy. Also, higher entropy = high information content (used in compression), hence we cannot compress a signal with high entropy.

So, in view of the above I have questions which is Is mutual information another name for information gain? Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter?

If you always output 0 as your estimate then your estimate will have zero entropy but is not necessarily a good estimate. — Aaron, Jun 06 '14 at 05:47

score 3 · Accepted Answer · answered Jun 06 '14 at 15:31

// So, in view of the above I have questions which is Is mutual information another name for information gain? //

No. But MI can be expressed in terms of KL (i.e. Info Gain) http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities

// Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter? //

Not sure if I fully understand the question, but there are proofs that minimizing KL is the only inference process that satisfies certain axioms one would deem reasonable from uncertain reasoning. Suggest you read "An Uncertain Reasoners Companion" - Jeff Paris.

KL and Entropy have both been shown to be the only measures of information (relative or absolute resp.) that satisfy 3 axioms that one would reasonably expect. Arthur Hobson proved for KL in "Concepts in Statistical Mechanics" (very expensive book), and Shannon proved for entropy (can be found in many Information Theory books).

The similarity between these 3 axioms and the proofs should hopefully help you understand the similarity in their meaning.

I believe it is the strong pure mathematical & philosophical foundation of Information Theory which is why Information Theoretic approaches perform so well and generalize like no other.

Thank you for the insights. I came across this link http://stats.stackexchange.com/questions/13389/information-gain-mutual-information-and-related-measures The answer says that mutual info can be alternatively called as KL or information gain. THe formula for MI and Info Gain = H(Y) - H(Y|X). Now, based on your observation MI is not the same as IG. THis has me confused!!Please help — Ria George, Jun 06 '14 at 15:48
MI of two distributions is the KL between the product of the marginal distributions and joint distributions, but is not the same as the KL between the two distributions themselves. The formulae as different. In fact MI is symmetric and KL is not (it says this clearly on the wikipedia pages). — samthebest, Jun 06 '14 at 16:59
Thank you for the clarification. So can it be said that if MI increases KL also increases?In information theory, what is the significance of increase in MI?Does it mean that the information content increases?(Should I post a new question for this?) — Ria George, Jun 07 '14 at 05:16
It's a relationship between two probability distributions so you can't really say "information content increases", but you can say the two RVs are more dependent on one another. For more detail, yes please post a new question. — samthebest, Jun 07 '14 at 17:09

Relationship between entropy and information gain

1 Answers1