Based on papers :1. Deniz Erdogmus, Member, IEEE, and Jose C. Principe, An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems
- J. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” in Unsupervised Adaptive Filtering, S. Haykin, Ed. New York: Wiley, 2000, vol. I, pp. 265–319.
Entropy (Shannon and Renyis) has been used in learning by minimizing the entropy of the error as an objective function instead of the Mean Square error. Rationale is that minimizing entropy = maximizing mutual information.
Now, entropy = disorder = uncertainty. Higher the uncertainty, more is the entropy. Also, higher entropy = high information content (used in compression), hence we cannot compress a signal with high entropy.
So, in view of the above I have questions which is Is mutual information another name for information gain? Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter?