12

I have seen at some points the use of the Radon-Nikodym derivative of one probability measure with respect to another, most notably in the Kullback-Leibler divergence, where it is the derivative of the probability measure of a model for some arbitrary parameter $\theta$ with respect to the real parameter $\theta_0$:

$$\frac {dP_\theta}{dP_{\theta_0}}$$

Where these are both probability measures on the space of datapoints conditional on a parameter value: $P_\theta(D)=P(D|\theta)$.

What is the interpretation of such a Radon-Nikodym derivative in the Kullback-Leibler divergence, or more generally between two probability measures?

jld
  • 18,405
  • 2
  • 52
  • 65
user56834
  • 2,157
  • 13
  • 35

1 Answers1

13

First, we don't need probability measures, just $\sigma$-finiteness. So let $\mathcal M = (\Omega, \mathscr F)$ be a measurable space and let $\mu$ and $\nu$ be $\sigma$-finite measures on $\mathcal M$.

The Radon-Nikodym theorem states that if $\mu(A) = 0 \implies \nu(A) = 0$ for all $A \in \mathscr F$, denoted by $\mu \gg \nu$, then there exists a non-negative Borel function $f$ such that $$ \nu(A) = \int_A f \,\text d\mu $$ for all $A \in \mathscr F$.

Here's how I like to think of this. First, for any two measures on $\mathcal M$, let's define $\mu \sim \nu$ to mean $\mu(A) = 0 \iff \nu(A) = 0$. This is a valid equivalence relationship and we say that $\mu$ and $\nu$ are equivalent in this case. Why is this a sensible equivalence for measures? Measures are just functions but their domains are tricky to visualize. What about if two ordinary functions $f, g :\mathbb R \to \mathbb R$ have this property, i.e. $f(x) = 0 \iff g(x) = 0$? Well, define $$ h(x) = \begin{cases} f(x) / g(x) & g(x) \neq 0 \\ \pi^e & \text{o.w.}\end{cases} $$ and note that anywhere on the support of $g$ we have $gh = f$, and outside of the support of $g$ $gh = 0 \cdot \pi^e = 0 = f$ (since $f$ and $g$ share supports) so $h$ lets us rescale $g$ into $f$. As @whuber points out, the key idea here is not that $0/0$ is somehow "safe" to do or ignore, but rather when $g = 0$ then it doesn't matter what $h$ does so we can just define it arbitrarily (like to be $\pi^e$ which has no special significance here) and things still work. Also in this case we can define the analogous function $h'$ with $g / f$ so that $fh' = g$.

Next suppose that $g(x) = 0 \implies f(x) = 0$, but the other direction does not necessarily hold. This means that our previous definition of $h$ still works, but now $h'$ doesn't work since it'll have actual divisions by $0$. Thus we can rescale $g$ into $f$ via $gh = f$, but we can't go the other direction because we'd need to rescale something $0$ into something non-zero.

Now let's return to $\mu$ and $\nu$ and denote our RND by $f$. If $\mu \sim \nu$, then this intuitively means that one can be rescaled into the other, and vice versa. But generally we only want to go one direction with this (i.e. rescale a nice measure like the Lebesgue measure into a more abstract measure) so we only need $\mu \gg \nu$ to do useful things. This rescaling is the heart of the RND.

Returning to @whuber's point in the comments, there is an extra subtlety to why it is safe to ignore the issue of $0/0$. That's because with measures we're only ever defining things up to sets of measure $0$ so on any set $A$ with $\mu(A) = 0$ we can just make our RND take any value, say $1$. So it is not that $0/0$ is intrinsically safe but rather anywhere that we would have $0/0$ is a set of measure $0$ w.r.t. $\mu$ so we can just define our RND to be something nice there without affecting anything.

As an example, suppose $k \cdot \mu = \nu$ for some $k > 0$. Then $$ \nu(A) = \int_A \,\text d\nu = \int_A k \,\text d \mu $$ so we have that $f(x) = k = \frac{\text d\nu}{\text d\mu}$ is the RND (this can be justified more formally by the change of measures theorem). This is good because we have exactly recovered the scaling factor.

Here's a second example to emphasize how changing RNDs on sets of measure $0$ doesn't affect them. Let $f(x) = \varphi(x) + 1_{\mathbb Q}(x)$, i.e. it's the standard normal PDF plus $1$ if the input is rational, and let $X$ be a RV with this density. This means $$ P(X \in A) = \int_A \left(\varphi + 1_{\mathbb Q}\right) \,\text d\lambda $$ $$ = \int_A \varphi \,\text d\lambda + \lambda\left(\mathbb Q \right) =\int_A \varphi \,\text d\lambda $$ so actually $X$ is still a standard Gaussian RV. It has not affected the distribution in any way to change $X$ on $\mathbb Q$ because it is a set of measure $0$ w.r.t. $\lambda$.

As a final example, suppose $X \sim \text{Pois}(\eta)$ and $Y \sim \text{Bin}(n, p)$ and let $P_X$ and $P_Y$ be their respective distributions. Recall that a pmf is a RND with respect to the counting measure $c$, and since $c$ has the property that $c(A) = 0 \iff A = \emptyset$, it turns out that $$ \frac{\text dP_Y}{\text dP_X} = \frac{\text dP_Y / \text dc}{\text dP_X / \text dc} = \frac{f_Y}{f_X} $$

so we can compute $$ P_Y(A) = \int_A \,\text dP_Y $$ $$ = \int_A \frac{\text dP_Y}{\text dP_X}\,\text dP_X = \int_A \frac{\text dP_Y}{\text dP_X}\frac{\text dP_X}{\text dc}\,\text dc $$ $$ = \sum_{y \in A} \frac{\text dP_Y}{\text dP_X}(y)\frac{\text dP_X}{\text dc}(y) = \sum_{y \in A} \frac{f_Y(y)}{f_X(y)}f_X(y) = \sum_{y \in A} f_Y(y). $$

Thus because $P(X = n) > 0$ for all $n$ in the support of $Y$, we can rescale integration with respect to a Poisson distribution into integration with respect to a binomial distribution, although because everything's discrete it turns out to look like a trivial result.


I addressed your more general question but didn't touch on KL divergences. For me, at least, I find KL divergence much easier to interpret in terms of hypothesis testing like @kjetil b halvorsen's answer here. If $P \ll Q$ and there exists a measure $\mu$ that dominates both then using $\frac{\text dP}{\text dQ} = \frac{\text dP / \text d\mu}{\text dQ / \text d\mu} := p / q$ we can recover the form with densities, so for me I find that easier.

jld
  • 18,405
  • 2
  • 52
  • 65
  • 4
    I enjoyed this exposition (as I enjoy all of your contributions), but at bottom it seems predicated on the (repeated) assertion that $0/0$ makes some kind of sense--but it does not. There's something going on with measures that doesn't automatically happen with functions of real values: *you may simply ignore what happens on sets of measure zero.* That's how you avoid having to make sense of $0/0$ in the Radon-Nikodym derivative setting. – whuber Feb 01 '18 at 14:59
  • 1
    @whuber thanks a lot for the comment, that really helps. I've tried to update to address that – jld Feb 01 '18 at 15:43