12

If $p(x)$ is a probability distribution with non-zero values on $[0,+\infty)$, for what type(s) of $p(x)$ does there exist a constant $c\gt 0$ such that $\int_0^{\infty}p(x)\log{\frac{ p(x)}{(1+\epsilon)p({x}(1+\epsilon))}}dx \leq c \epsilon^2$ for all $0\lt\epsilon\lt 1$?

The inequality above is actually a Kullback-Leibler Divergence between distribution $p(x)$ and a compressed version of it ${(1+\epsilon)}p({x}{(1+\epsilon)})$. I have found out that this inequality holds for Exponential, Gamma, and Weibull distributions and I am interested to know if that works for a larger class of probability distributions.

Any idea what that inequality means?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Sus20200
  • 371
  • 2
  • 12
  • 3
    Since $\epsilon$ is positive that would be compressed (in the x-direction) rather than stretched. – Glen_b Dec 10 '15 at 23:23
  • 2
    This question is ambiguous: what are your quantifiers? Do you want this inequality to hold for *all* $\epsilon$, *at least one* $\epsilon$, or something else? Is $c$ given *a priori* or do you mean there should *exist at least one such value* of $c$? And since you mention *classes* of probability distributions, by "$p(x)$" do you mean *one specific* distribution or do you perhaps mean a parametric family of them? – whuber Dec 11 '15 at 15:27
  • 2
    @whuber Thanks for your comments. I made correction to my problem statement to clarify the mentioned issues. I mean, for what $p(x)$ the above inequality holds? The answer might be either introducing a parametric family of distributions or proposing a differential equation for $p(x)$ that suffices and gives the desired inequality. – Sus20200 Dec 11 '15 at 22:09
  • 2
    Wouldn't this inequality work for any p(x) which is continuous and with infinite support ? You are computing the KL divergence inside a parametric family ( $\epsilon \rightarrow p(x(1+\epsilon))$. If the KL is diffentiable at 0, then it's derivative is 0. Taking $C$ to be the maximum of the curvature of KL (for $\epsilon \in [0,1]$), we have the bound. With additional work, it might be possible to bound C from properties of p – Guillaume Dehaene Dec 17 '15 at 10:07
  • Thanks a lot. That doesn't work for pareto distribution for instance which is infinite at zero. I guess the distribution should be finite and non zero at origin. Can you please give me more details? – Sus20200 Dec 17 '15 at 13:55
  • 1
    It can be infinity as long as $L = \lim_{x \rightarrow 0} p(x)x = 0$. The first order expansion of the KL is $L \epsilon + O(\epsilon^2)$ – Arthur B. Dec 17 '15 at 23:21
  • @ArthurB. can you please give me more details on the expansion of that. I am not sure why the expansion under integral is valid (we might need dominated convergence theorem) – Sus20200 Dec 21 '15 at 19:53
  • 1
    You do need to be able to take the derivative under the integral to make this argument. Once you do, you have $L = \lim_{x\rightarrow \infty} p(x) x - \lim_{x\rightarrow 0} p(x) x$ – Arthur B. Dec 22 '15 at 16:37
  • @arthurb. How can I make sure I can take the derivative under the integral? – Sus20200 Dec 23 '15 at 15:26
  • By refraining from crafting a pathological case where you can't..... Otherwise, dominated convergence is sufficient, as you point out. – Arthur B. Dec 23 '15 at 15:27

1 Answers1

4

Preliminaries

Write

$$\mathcal{I}_p(\epsilon) = \int_0^\infty p(x) \log\left(\frac{p(x)}{(1+\epsilon)p(x(1+\epsilon))}\right)\, dx.$$

The logarithms and the relationship between $p(x)$ and $p(x(1+\epsilon))$ suggest expressing both $p$ and its argument as exponentials. To that end, define

$$q(y) = \log(p(e^y))$$

for all real $y$ for which the right hand side is defined and equal to $-\infty$ wherever $p(e^y)=0$. Notice that the change of variables $x=e^y$ entails $dx=e^y dy$ and (taking $p$ to be the density of a distribution) that the Law of Total Probability can thereby be expressed as

$$1 = \int_0^\infty p(x)dx = \int_\mathbb{R} e^{q(y)+y} dy.\tag{1}$$

Let us assume $e^{q(y)+y}\to 0$ when $y\to\pm\infty$. This rules out probability distributions $p$ with infinitely many spikes in density near $0$ or $\infty$. In particular, if the tails of $p$ are eventually monotonic, $(1)$ implies this assumption, showing it is not a severe one.

To make working with the logarithms easier, also observe that

$$1+\epsilon = e^\epsilon + O(\epsilon^2).$$

Because the following calculations will be performed up to multiples of $\epsilon^2$, define

$$\delta = \log(1+\epsilon).$$

We might as well replace $1+\epsilon$ by $e^\delta$, with $\delta=0$ corresponding to $\epsilon=0$ and positive $\delta$ corresponding to positive $\epsilon$.

Analysis

One obvious way in which the inequality can fail would be for the integral $\mathcal{I}_p(\epsilon)$ to diverge for some $\epsilon \in (0, 1]$. This would happen if, for instance, there were to be any proper interval $[u, v]$ of positive numbers, no matter how small, in which $p$ were identically zero but $p$ were not zero on the interval $[u-\epsilon, v-\epsilon]$. That would cause the integrand to be infinite with positive probability.

Because the question is unspecific concerning the nature of $p$, we could get bogged down in technical issues concerning how smooth $p$ might be. Let's avoid such issues, still hoping to gain some insight, by assuming that $q$ everywhere has as many derivatives as we might care to use. (Two will suffice if $q^{\prime\prime}$ is continuous.) Because that guarantees $q$ remains bounded on any bounded set, it implies that $p(x)$ is never zero when $x \gt 0$.

Note that the question really concerns the behavior of $\mathcal{I}_p(\epsilon)$ as $\epsilon$ approaches zero from above. Since this integral is a continuous function of $\epsilon$ in the interval $(0,1]$, it attains some maximum $M_p(a)$ when $\epsilon$ is restricted to any positive interval $[a,1]$, enabling us to choose $c = M_p(a)/a^2$, because obviously $$c\epsilon^2 = M_p(a) \left(\frac{\epsilon}{a}\right)^2 \ge M_p(a) \ge \mathcal{I}_p(\epsilon)$$

makes the inequality work. This is why we need only be concerned with the calculation modulo $\epsilon^2$.

Solution

Using the changes of variable from $x$ to $y$, from $p$ to $q$, and $\epsilon$ to $\delta$, let's calculate $\mathcal{I}_p(\epsilon)$ through second order in $\epsilon$ (or $\delta$) in the hope of achieving a simplification. To that end define

$$\mathcal{R}(y, \delta) \delta^2 = q(y+\delta) - q(y) - \delta q^\prime(y)$$

to be the order-$2$ remainder in the Taylor expansion of $q$ around $y$.

$$\eqalign{ \mathcal{I}_p(\epsilon) &= \int_\mathbb{R}e^{q(y) + y} \left(q(y) - q(y+\delta) - \delta\right)\, dy \\ &=-\int_\mathbb{R}e^{q(y) + y} \left(\delta + \delta q^\prime(y) + \mathcal{R}(y, \delta) \delta^2 \right)\, dy \\ &= -\delta\int_\mathbb{R}e^{q(y) + y} \left(1+q^\prime(y)\right)\, dy -\delta^2\int_\mathbb{R}e^{q(y) + y} \mathcal{R}(y, \delta)\, dy. }$$

Changing variables to $q(y)+y$ in the left hand integral shows it must vanish, as remarked in the assumption following $(1)$. Changing variables back to $x=e^y$ in the right hand integral gives

$$\mathcal{I}_p(\epsilon) = - \delta^2 \int_\mathbb{R} p(x) \mathcal{R}(\log(x), \delta)\, dy = -\delta^2 \mathbb{E}_p\left(\mathcal{R}(\log(x), \delta)\right).$$

The inequality holds (under our various technical assumptions) if and only if the coefficient of $\delta^2$ on the right hand side is finite.

Interpretation

This is a good point to stop, because it appears to uncover the essential issue: $\mathcal{I}_p(\epsilon)$ is bounded by a quadratic function of $\epsilon$ precisely when the quadratic error in the Taylor expansion of $q$ doesn't explode (relative to the distribution) as $y$ approaches $\pm\infty$.

Let's check some of the cases mentioned in the question: the Exponential and Gamma distributions. (The Exponential is a special case of the Gamma.) We never have to worry about scale parameters, because they merely change the units of measurement. Only non-scale parameters matter.

Here, because $p(x) = x^k e^{-x}$ for $k \gt -1$, $$q(y) = -e^y + k y - \log\Gamma(k+1).$$ The Taylor expansion around an arbitrary $y$ is $$\text{Constant} + (k-e^y)\delta - \frac{e^y}{2}\delta^2 + \cdots.$$ Taylor's Theorem with Remainder implies $\mathcal{R}(\log(x),\delta)$ is dominated by $e^{y+\delta}/2 \lt x$ for sufficiently small $\delta$. Since the expectation of $x$ is finite, the inequality holds for Gamma distributions.

Similar calculations imply the inequality for Weibull distributions, Half-Normal distributions, Lognormal distributions, etc. In fact, to obtain counterexamples we would need to violate at least one assumption, forcing us to look at distributions where $p$ vanishes on some interval, or is not continuously twice differentiable, or has infinitely many modes. These are easy tests to apply to any family of distributions commonly used in statistical modeling.

whuber
  • 281,159
  • 54
  • 637
  • 1,101