Understanding proof of a lemma used in Hoeffding inequality

Question

I am studying Larry Wasserman's lecture notes on Statistics which uses Casella and Berger as its primary text. I am working through his lecture notes set 2 and got stuck in the derivation of lemma used in Hoeffding's inequality (pp.2-3). I am reproducing the proof in the notes below and after the proof I will point out where I am stuck.

Lemma

Suppose that $\mathbb{E}(X) = 0$ and that $ a \le X \le b$. Then $\mathbb{E}(e^{tX}) \le e^{t^2 (b-a)^2/8}$.

Proof

Since $a \le X \le b$, we can write $X$ as a convex combination of $a$ and $b$, namely $X = \alpha b + (1 - \alpha) a$ where $\alpha = \frac{X-a}{b-a}$. By convexity of the function $y \to e^{ty}$ we have

$e^{tX} \le \alpha e^{tb} + (1 - \alpha) e^{ta} = \frac{X-a}{b-a} e^{tb} + \frac{b-X}{b-a} e^{ta}$

Take expectations of both sides and use the fact $\mathbb{E}(X) = 0$ to get

$\mathbb{E}(e^{tX}) \le \frac{-a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta} = e^{g(u)}$

where $u = t(b-a)$, $g(u) = -\gamma u + \log(1-\gamma + \gamma e^{u})$ and $\gamma = -a /(b-a)$. Note that $g(0) = g^{'}(0) = 0$. Also $g^{''}(u) \le 1/4$ for all $u > 0 $.

By Taylor's theorem, there is a $\varepsilon \in (0, u)$ such that $g(u) = g(0) + u g^{'}(0) + \frac{u^2}{2} g^{''}(\varepsilon) = \frac{u^2}{2} g^{''}(\varepsilon) \le \frac{u^2}{8} = \frac{t^2(b-a)^2}{8}$

Hence $\mathbb{E}(e^{tX}) \le e^{g(u)} \le e^{\frac{t^2(b-a)^2}{8}}$.

I could follow the proof till

$\mathbb{E}(e^{tX}) \le \frac{-a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta} = e^{g(u)}$ but I am unable to figure out how to derive $u, g(u), \gamma$.

It is interesting that the _maximum_ value of $\text{var}(X)$ is $\sigma_{\max}^2 = (b-a)^2/4$ and thus the result is effectively $$E[e^{tX}] \leq e^{\sigma_{\max}^2t^2/2}$$ which looks far too familiar to be arising out of sheer coincidence. I suspect there may be another, possibly easier, way to derive the result via a probabilistic argument. — Dilip Sarwate, Jan 14 '12 at 01:42
@DilipSarwate My understanding is that the maximum variance occurs for an uniform random variable $ X \sim \mathcal{U}(a,b)$. The variance of $X$ is $\mathsf{Var}(X) = \frac{(b-a)^2}{12}$. Can you please explain how you got $\frac{(b-a)^2}{4}$? — Anand, Jan 14 '12 at 03:59
@DilipSarwate I added a few comments in the proof, that may clarify a liitle bit why the worst case is the max variance. — Elvis, Jan 14 '12 at 10:43
Can someone add a tag `inequality` or `inequalities`? I don't have enough points to do it. — Anand, Jan 14 '12 at 14:26
@DilipSarwate - See lemma 1 and exercise 1 here: http://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/ . It seems there is a simpler derivation relying on Jensen's inequality and taylor's expansion. Yet the details of this are unclear to me. Perhaps someone can make sense of it. (derivation of (9) to (10) and exercise 1) — Leo, Aug 13 '13 at 20:21

score 17 · Accepted Answer · edited Jan 17 '12 at 21:28

I’m not sure I understood your question correctly. I’ll try to answer: try to write $$-\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta}$$ as a function of $u = t(b-a)$ : this is natural as you want a bound in $e^{u^2 \over 8}$.

Helped by the experience, you will know that it is better to chose to write it in the form $e^{g(u)}$. Then $$e^{g(u)} = -\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta}$$ leads to $$\begin{align*} g(u) &= \log\left( -\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta} \right)\\ &= \log\left( e^{ta} \left( -\frac{a}{b-a} e^{t(b-a)} + \frac{b}{b-a} \right)\right)\\ &= ta + \log\left( \gamma e^u + (1-\gamma) \right)\\ &= -\gamma u + \log\left( \gamma e^u + (1-\gamma) \right),\\ \end{align*}$$ with $\gamma = - {a \over b-a}$.

Is that the kind of thing you were asking for?

Edit: a few comments on the proof

The first trick deserves to be looked at carefully: if $\phi$ is a convex function, and $a\le X\le b$ is a centered random variable, then $$\mathbb{E}(\phi(X)) \le -{a\over b-a} \phi(b) + {b \over b-a} \phi(a) = \mathbb{E}(\phi(X_0)),$$ where $X_0$ is the discrete variable defined by $$\begin{align*} \mathbb P(X_0=a) &= {b \over b-a}\\ \mathbb P(X_0=b) &= -{a \over b-a}.\end{align*}$$ As a consequence, you get that $X_0$ is the centered variable with support in $[a,b]$ which has the highest variance: $$\mathsf{Var} (X) = \mathbb{E}(X^2) \le \mathbb{E}(X_0^2) = {ba^2 - ab^2 \over b-a } = - ab.$$ Note that if we fix a support width $(b-a)$, this is less than $(b-a)^2\over 4$ as Dilip says in the comments, this is because $(b-a)^2 + 4ab \ge 0$; the bound is attained for $a=-b$.
Now turn to our problem. Why is it possible to get a bound depending only on $u = t(b-a)$? Intuitively, it is just a matter of rescaling of $X$: if you have a bound $\mathbb{E}\left( e^{tX} \right) \le s(t)$ for the case $b-a=1$, then the general bound can be obtained by taking $s( t(b-a) )$. Now think to the set of centered variables with support of width 1: there isn’t so much freedom, so a bound like $s(t)$ should exist.

An other approach is to say simply that by the above lemma on $\mathbb{E}(\phi(X))$, then more generally $\mathbb{E}(\phi(tX)) \le \mathbb{E}(\phi(tX_0))$, which depends only on $u$ and $\gamma$: if you fix $u = u_0 = t_0 (b_0 - a_0)$ and $\gamma = \gamma_0 = - {a_0 \over b_0-a_0}$, and let $t, a, b$ vary, there is only one degree of freedom, and $t = {t_0 \over \alpha}$, $a = \alpha a_0$, $b = \alpha a_0$. We get $$-{a\over b-a} \phi(tb) + {b \over b-a} \phi(ta) = -{a_0\over b_0-a_0} \phi(tb_0) + {b_0 \over b_0-a_0} \phi(a_0).$$ You just have to find a bound involving only $u$.
Now we are convinced that it can be done, it must be much easier! You don’t necessarily think to $g$ to begin with. The point is that you must write everything as a function of $u$ and $\gamma$.

First note that $\gamma = -{a\over b-a}$, $1 -\gamma = {b\over b-a}$, $at = -\gamma u$ and $bt = (1-\gamma)u$. Then $$\begin{align*} \mathbb{E}(\phi(tX)) \le &-{a\over b-a} \phi(tb) + {b \over b-a} \phi(ta) \\ = & \gamma \phi((1-\gamma)u) + (1-\gamma) \phi(-\gamma u) \end{align*}$$

Now we are in the particular case $\phi=\exp$... I think you can finish.

I hope I clarified it a little bit.

@Anand I know it’s a hard to follow advice, however I think you shouldn’t start by focusing on technical details but rather try to get *why* such a bound can exist... then the proof should appear easier. I tried to show you the *why* in the second part, added this morning (you need to sleep on a question like this – at least I need to). I think it’s terrible how this kind of intuitions doesn’t appear in most textbooks... even if you get the technical part, as long as you don’t have the ideas everything looks magic. Thank you and CrossV for giving me the opportunity to think to this in details! — Elvis, Jan 14 '12 at 13:20
Wow! +1 for the edit. Thanks. But wouldn't it be nice if it were possible to get something like $$E[e^{tX}] \leq e^{E[t^2X^2/2]} = e^{(t^2/2)E[X^2]} = e^{(t^2/2)\text{var}(X)} \leq e^{t^2\sigma_{\max}^2/2}?$$ — Dilip Sarwate, Jan 14 '12 at 13:40
@Elvis Thanks for the advise and for taking the time to write down the intuitive part. I need to spend some time to understand this! — Anand, Jan 14 '12 at 14:30
@Elvis Taking about intuition, I want to clarify my understanding. To get sharper bounds one needs higher moments. Markov uses the first moment, Chebyshev the second moment and the Hoeffding uses mgf. Is this correct? If someone can expand and clarify this part it would be great. — Anand, Jan 14 '12 at 14:34
@DilipSarwate This is a stronger result. But it looks likely true... I need a paper and a pen (a paper, a pen! My kindom for...) — Elvis, Jan 14 '12 at 14:35
@Anand I think this is it, yes. I need to improve my understanding of all this myself... — Elvis, Jan 14 '12 at 14:37
@DilipSarwate: Sorry! I got sloppy near the end of my previous comment. It was not correct as stated. I have removed it for the time being. — cardinal, Jan 14 '12 at 19:09
@DilipSarwate I think this is false... I add a few lines in the article. — Elvis, Jan 14 '12 at 22:23
@DilipSarwate I don’t think it is useful to add it in the answer (which is now too long to make editing comfortable: on my laptop the automatic refreshing of the preview is very slow and this is really painfull). I will just give you the arguments here. When you study the function $g(u)$ defined in Wasserman’s proof, you have to show that $$g''(u) = {1\over 4} \times {4\times (1-\gamma) \times \gamma e^u \over \left( (1-\gamma) + \gamma e^u \right)^2 } \le {1\over 4}.$$ This bound is sharp for small values of $u$, independently of $\gamma$. So $$ g''(u) \le \gamma(1-\gamma)$$ doestn’t hold. — Elvis, Jan 15 '12 at 09:18
I hope this is enough, from this remark it is easy to construct a counter example (with $\gamma \ne {1\over 2}$). I went a little too fast on the sharpness on the bound: it is the bound on $g(u)$ which is sharp for small $u$. The bound on $g''(-u)$ cannot be improved, it is attained for $\gamma e^u = (1-\gamma)$. — Elvis, Jan 15 '12 at 09:22

score 1 · Answer 2 · answered Oct 28 '21 at 10:05

As @DilipSarwate noticed we effectively have

\begin{equation} E[e^{tX}] \leq e^{\sigma_{\max}^2t^2/2}, \end{equation}

which seems to be more than a mere coincidence and can leads us to think that there is a probabilistic argument to derive the result.

And it is the case. The gist of the demonstration goes as follow :

We want to prove that if X is a bounded random variable then it is a $\frac{(b-a)^2}{4}-$ sub-gaussian random variable.

To do so we will first consider the logarithmic moment-generating function $g(t) = \log\mathbb{E}[e^{t\eta}]$. This function is indeed well-defined on $R$ as $\forall t \in R, e^{t\eta} > 0$ and therefore $\mathbb{E}[e^{t\eta}]>0$. We also have that g is differentiable as $\mathbb{E}[e^{t\eta}] >0 $ and $t\mapsto \mathbb{E}[e^{t\eta}]$ is differentiable (because $\forall t \in R,\: |\mathbb{E}[\eta e^{t\eta}]| \leqslant (|a| + |b|)e^{(|a|+|b|)t} < \infty$).

We then have : \begin{align*} \forall t \in R,\: g'(t) = \frac{\mathbb{E}[\eta e^{t\eta}]}{\mathbb{E}[e^{t\eta}]} \end{align*}

Showing with similar arguments that g' is in turn differentiable we get : \begin{equation} \begin{split} \forall t \in R,\: g''(t) & = \frac{\mathbb{E}[\eta^2 e^{t\eta}]\mathbb{E}[e^{t\eta}] - \mathbb{E}[\eta e^{t\eta}]^2}{\mathbb{E}[e^{t\eta}]^2} & = \mathbb{E}[\eta^2\frac{e^{t\eta}}{\mathbb{E}[e^{t\eta}]}] - \mathbb{E}[\eta\frac{e^{t\eta}}{\mathbb{E}[e^{t\eta}]}]^2 \end{split} \end{equation}

We can now notice that $g''$ is the variance of the random variable Y of density $dy(t) = \frac{e^{t\eta}}{\mathbb{E}[e^{t\eta}]}dP_{\eta}(t)$. The law of Y being absolutely continuous with X's one we know that Y's support is [a,b]. We also know that for any real-valued random variable $Var(U) \leqslant \mathbb{E}[(U-c)^2] \: \forall c \in R$ which when we take $c=\frac{a+b}{2}$ yields $Var(U) \leqslant \mathbb{E}[(U - \frac{a+b}{2})^2] \leqslant (\frac{b-a}{2})^2$

We thus get $\forall t \in R, \: g''(t) \leqslant (\frac{b-a}{2})^2 $ We can also remark that $g(0) = 0 = \mathbb{E} \eta = g'(0)$

\begin{equation} \begin{split} \forall t \in R,\: g(t) & = g(t) - g(0) = \int_{0}^{t} g'(s) \,ds = \int_{0}^{t}\int_{0}^{s} g''(u) \,du\,ds & \leqslant \int_{0}^{t}\int_{0}^{s} (\frac{b-a}{2})^2 \,du\,ds \leqslant (\frac{b-a}{2})^2 \int_{0}^{t} s \,ds & \leqslant \frac{(b-a)^2}{4}\frac{t^{2}}{2} \leqslant \frac{(b-a)^2t^{2}}{8} \end{split} \end{equation}

At last by composing this inequality with the exponential we get the desired result.

Understanding proof of a lemma used in Hoeffding inequality

2 Answers2

Linked