How to derive the bias of an entropy estimate

Question

I am trying to understand the bias-variance trade-off in the context of non-parametric entropy estimation.

Specifically using a histogram approach to estimate the entropy of a sample we have:

$$\hat{H} = - \sum^{B}_{i=1}\hat{p}_iv_i\log(\hat{p}_i) $$

(for a generic partition of $B$ bins where $\hat{p}_i$ is the estimate of probability density in bin $i$ and $v_i$ is the volume of the bin $i$).

I understand that bias is defined as: $E[\hat{H}] - H$, where $H$ is the 'true' entropy, but in the general case one doesn't know the true population distribution (hence the nonparametric estimation). Hence I don't understand how bias is calculated in general for this estimator, nor how it changes with the parameter $B$?

Secondly, the variance is given by: $E[E[\hat{H}] - \hat{H}]$, which avoids the above problem but how does the expectation value $E[\hat{H}]$ even differ from the estimator $\hat{H}$?

Finally does it even make sense to consider variance outside the context of 'training' the histogram estimator - if you assume there is just one sample and one is trying to get a close estimate to the true value, it doesn't feel like over-fitting is a concern and one should aim for the minimum-bias parameterisation.

I think I am missing some really basic context here as these simple concepts are not making sense to me.

score 1 · Accepted Answer · answered Feb 27 '21 at 16:23

First, don't forget that the bias-variance tradeoff is in terms of the expected values in application of your model to new samples from the underlying population. See this answer. You presumably want a model to work well on the population, not just n your data set.

Second, the form of the downward bias of the plug-in estimator of the Shannon entropy that you show has been known since the 1950s. See this answer and its links for details.

Third, the bootstrap can provide non-parametric estimates of the bias; that was one of the reasons for developing the bootstrap. Under the bootstrap principle, taking multiple bootstrap samples of the original size (with replacement) from your data set mimics the process of taking your original sample from the underlying population. So the difference between the entropy estimates on the bootstrapped samples and the entropy estimate on the original sample gives an estimate of the bias of your original entropy estimate. See this answer for more details and links. In your application, you could try different bin sizes/numbers and evaluate how that bias changes with the number of bins $B$, given the total size of your data set.

Thanks! So my confusion was largely around the fact that I'm not considering 'training' the histogram as an ML model; rather it's a one-off estimate of the entropy/density. Hence it sounds like my suspicion was correct: that we can ignore the variance part - we just want to calculate the bias. However I'm afraid that your answer doesn't quite help solve my problem, which is to derive analytically the bias/error of a histogram entropy estimate (e.g. conceptually it is a function of the B parameter) — Zac, Feb 27 '21 at 19:08
@Zac check [this paper](https://arxiv.org/pdf/cond-mat/0403192v3.pdf) for work on analytical solutions to the entropy-estimator bias problem (although I'm not familiar with the histogram-estimate formula per se). In practice, bootstrapping should provide an easy and useful estimate of bias, allowing you to correct your sample-based estimates back to the original population. As your estimate is non-parametric to begin with, it's not clear what advantage an analytic solution would provide over the non-parametric bootstrap. — EdM, Feb 27 '21 at 20:43

How to derive the bias of an entropy estimate

1 Answers1