0

Given a random variable $X=x_i$ for $i=1,...,n$, with true distribution $p(x)$ and approximate distribution $q(x)$, its cross entropy is given by

$$H(p,q) = -\sum_{i=1}^np(x_i)\log q(x_i).$$

However in practice we find that the true distribution $p(x)$ is rarely if ever known, and thus instead we compute an "approximation" to the cross entropy, given by

$$-\sum_{i=1}^n\frac{1}{n}\log q(x_i).$$

Why is this a good approximation of the cross entropy?

I can't figure out why it would be, even if we sample enough points from the true distribution to make $q(x)$ a really good approximation for $p(x)$, why would we want all weights to remain the same? In that case wouldn't we want to just replace $\frac{1}{n}$ with $q(x_i)$? And if $q(x)$ is not known to be a good approximation, then setting $p(x_i)=\frac{1}{n}$ seems like a completely arbitrary choice.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
Thoth
  • 1,213
  • 9
  • 24

1 Answers1

2

Perhaps notation misled you to confuse the support of $X$ with a data sample from $X$?

Since in most cases the letter $n$ is used to denote sample sizes, let's denote the support of $X$ as

$$X\in \{x_1,...,x_k,...,x_m\}$$

i.e. containing $m$ distinct values (since $X$ is discrete here). Then the cross-entropy is defined as

$$H(p,q) = -\sum_{k=1}^mp(x_k)\log_2 q(x_k) = -E_p\big(\log_2[q(X)]\big)$$

At present, it is a theoretical quantity that involves all the possible values of $X$ and the two probability mass functions, without any relation whatsoever to actual realizations of $X$ namely without any connection to any data sample.

Assume now that we obtain a data sample of size $n$ of realizations of $X$, denote it $\{x_i,i=1,...,n\}$. Here the indexing of the realizations is just to identify them -it has no connection to what their values may be and each possible value that $X$ can take may appear in the data sample more than once. Consider now the estimated cross-entropy

$$\hat H(p,q) = -\frac 1n\sum_{i=1}^n\log_2 q(x_i)$$

Note that here, $x_i$ are the realizations of $X$ not its theoretically possible values $x_k$, and we are summing over the data sample. So the same numerical value may appear many times. Denote the times that value $x_k$ appears in the sample by $n_k$. Then we can decompose the sum into

$$\hat H(p,q) = -\frac 1n\Big(n_1\log_2 q(x_1) +...+ n_m\log_2 q(x_m)\Big) $$

$$= -\Big( \frac {n_1}{n}\log_2 q(x_1) + ...+ \frac {n_m}{n}\log_2 q(x_m)$$

But $\frac {n_k}{n}=\hat p(x_k)$. So reassembling but now summing over the values in the support

$$\Rightarrow \hat H(p,q) = - \sum_{k=1}^m\hat p(x_k)\log_2 q(x_k)$$

which obviously is the empirical analogue of the theoretical relationship.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • so given enough iid samples $x_i$, this is just using law of large numbers to approximate $H(p,q)$? I think you're right, I was probably mislead by the notation. – Thoth Aug 14 '14 at 03:02
  • To be exact, it is the application of the method of moments, which provides a consistent estimator for the Cross entropy. – Alecos Papadopoulos Aug 14 '14 at 03:05
  • By method of moments you mean expanding the logarithm as a power series and matching up each power of the random variable? I thought however that for any (suitably nice) function $g$, $g(\bar{X}) \rightarrow E[g(X)]$. – Thoth Aug 14 '14 at 03:09
  • No, "method of moments" is taking a relation involving expected values and a random variable, and use in its place its "sample analogue" which is the sample mean of realizations from this random variable (here the sample mean of a _function_ of the relaizations). – Alecos Papadopoulos Aug 14 '14 at 03:13
  • As for your additional comment, again, no: $g(\bar X) \rightarrow g(E(X))$ (by the continuous mapping theorem), and, by Jensen's Inequality, $g(E(X)) \neq E(g(X))$ except if $g()$ is linear. – Alecos Papadopoulos Aug 14 '14 at 03:15
  • Oops, I meant $\overline{g(X)}\rightarrow E[g(X)]$, is that not true either? – Thoth Aug 14 '14 at 03:18
  • This is true for ergodic processes in general, which is our case also, and so this is what happens here. of course this is LLN. I stressed the consistency property, so that, if in addition $q(x) = \hat p(x)$ by sample relative frequencies, to be easier to see that $\hat H \rightarrow H$. – Alecos Papadopoulos Aug 14 '14 at 03:27